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In this lecture, we will discuss some of the basic mechanisms in which nodes in a distributed 
system communicate. So, these are very large distributed systems. So, in such distributed systems, 
often a node only knows the IDs of some of the nearby nodes, it does not clearly know which 
nodes comprise the entire network. So, this is not node. So, the communication between nodes in 
such cases is tricky, because given the fact that we do not know all the nodes in the network, we 


will not know if the message is reaching or not. 
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So, what we will discuss in this lecture set, in this lecture actually, in this video, are basically two 
kinds of protocols, Epidemic Protocols and Gossip Based Protocols, say Epidemic Protocols are 
clearly more popular in the literature, such as Anti-Entropy and Rumor Mongering, in comparison, 
Gossip Based Protocols are not that popular, but nevertheless they are also used. And we will see 
that they are heavily used in a lot of the distributed systems, particularly in modern systems where, 
we are talking of large web scale systems. So, these distributed systems are kind of distributed all 


over the internet. 
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Overlay Network 


It is an application level network that is independent of the un- 
derlying network topology. 


overlay network 


@ Most common overlay networks are a ring and star. 
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So, the key idea of a distributed system is to form an overlay network. An overlay network is 
basically a virtual network, essentially a network over a network. So, regardless of how the 
physical organization of the network is? so, they could be wireless links, they could be Ethernet 


links, there are a wide variety of possibilities. 


So, regardless of that, we create a virtual network. And that is known as the overlay. Where as you 
can see, this is a ring shaped overlay, where each node just knows the ID of its left neighbor and 


the ID of its right neighbors. Essentially, it is clockwise neighbor and its anti clockwise neighbor. 


And we shall see that such overlays form the foundation of many of today's peer to peer networks. 
And most of our enterprise networks use such kind of overlays. And most of the common ones are 


clearly a ring and a star. So, this is more like a structured network. 
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Ring and Star 


@ A star is acentralized configuration. 
The central node is typically a server. ¢ 
e The rest of the nodes are clients. § 7) nf - cov vor 
@ Aring is the basis of a structure called a DHT (distributed 
hash table) 
see @ We will study about ring topologies in third generation peer 
to peer networks. 


@ Let us first focus on unstructured overlay networks. “4 


@ There is no fixed global topology. ; “10 
@ Anode typically only knows.a subset of other nodes. 
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So, the definition of a star is something that all of you know, but nevertheless, might not be a bad 
idea to define where we have one central node that can be like a server, or some sort of a base 
server. And then we have many other client nodes connected to it. So, a typical client server 
computing, where we have multiple clients and a single server would use such kind of a star shaped 


overlay. 


So, why do I call it an overlay? because this is not the real organization of the network. It is a 
virtual organization. So, given that it is a virtual organizational network, I am using that term 
overlay. A ring in comparison is way more popular. A ring is the basis of a structure called a 


distributed hash table or a DHT. 


And they started coming in, in a big way third generation peer to peer networks. So, here what we 
do is that we can have a large system, over the internet, but that is, regardless of where they are, 
we just connect them as a virtual circle as a virtual drink. So, star shaped overlays and ring shaped 


overlays formed the basis of what we call structured networks. 


But our entire discussion here has centered on centered around unstructured overlay networks 
where there is no fixed global topology. So, fix global topology is definitely not there are no 
typically nodes, a subset of other nodes, and a node is pretty much blind. Somebody can argue that 
in a ring shipped overlay also a node is pretty much blind in the sense this node only knows about 


its clockwise successor and anti clockwise successor. 


But at least there is a structure in an unstructured overlay network there is no structure. so, there 
is no global structure you only have a small local structure and that too if there is one, sometimes 


you just know the list of a few nodes, that is it. 
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So, when we are talking of such unstructured networks, there is a problem, if I want to send a 
message or multicast a message to a group of nodes, there clearly is a problem and the problem is 
that I may not be able to reach the entire network, because I will not be aware of the state of the 


entire network. 


And since I am not aware of the state that will to a large extent, cause a problem. So, this is not 
there with structured networks with a ring or a star, but we are mainly looking at unstructured 
networks here and there is an issue. So, what I can do is that I can send a message to all the 


neighbors. I can ask the neighbors to forward messages to their neighbors and so on. 


So, this will kind of exponentially flood the network. But the negative aspect of this will create an 
exponential number of messages, which will become almost impossible to handle and there is no 
guarantee that the entire network of nodes will actually be reached. So, we need some way of kind 


of roping this in that pure exponential way is a bad way. 


And furthermore, we need some mathematical techniques for analysis. For analyzing the system, 


for analyzing this setup, we need some mathematical techniques. 
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Epidemic Algorithms for Replicated Database Maintenance by 
Alan Demers. Dan, Greene, Carl Hauser, Wes Irish, John. Lar- 
son. Scott Shenker, Howard Sturgis, Dan Swinehart, and Doug 
Terry{PODC 1987_) 


.@ Problem: Propagate updates to a large set of databases in 
a \ Xerox's corporate intranet. (crs ¢yucturred) 


@ Updates are injected at one site and propagated to the rest 
of the sites. 
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So, what I will discuss at the beginning, is this paper Epidemic Algorithms for replicated database 
maintenance. By as you can see a long list of authors published in PODC 1987. So, what was the 
basic problem that was being solved? The basic problems, was that there used to be a big company 


called Xerox. 


So, we do not get to hear Xerox a lot in 2020. But Xerox at | point was a preeminent power in 
computing technology. So, for example, the graphical windows that you see, large part of that 
came from Xerox, the mouse came from Xerox. So, Xerox PARC Research Lab, used to be 


counted as maybe the world's best research lab in this area. 


There was a competition with Xerox and IBM. So, Xerox had a large corporate intranet, which did 
not have a structure, it was unstructured. And they were maintaining databases. So, these databases 
that were being maintained, they needed to be kept in sync, which meant that updates had to be 


propagated, and how to propagate updates given that there were so many machines. 


And those days, network communication was expensive. In terms of time, at least, this was a 
problem. So, updates are injected, typically at one site, and they have to be propagated to the rest 


of the sites. And given that one site did not have a global list, this became a problem. 
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So, there are three main approaches that we should look at. The three main approaches are Direct 
Mail, for update is sent from one site to all sites, the reason that direct mail could not be used was 
basically because number one, there were too many sites. And their exact list was not known. And 


furthermore, this is a sequential process. 


It is not a parallel process, it will take a lot of time to send messages and we are dealing with a 
slow internet, slow intranet situation over here. So, that is the reason the two broad approaches 
that were used are Anti-Entropy and Rumor Mongering. And these approaches later on were 


worked on by a generation of researchers. 


And today there are very sophisticated Anti-Entropy and Rumor Mongering methods, which are 
actually used by many of the companies that effect our lives such as Amazon, Facebook, LinkedIn, 
and so on, of course, not the same way that Xerox used, but in a modified version that is suitable 


to those companies, suitable to the systems of those companies and improved. 


So, Anti-Entropy says that we choose a site at random. And we synchronize the contents of the 
database by exchanging contents, which means that I randomly choose a site, I give it my updates, 
and I take the updates of that site and update my own database. So, both of us become both of us 
know, whatever we knew earlier, in the sense that let us say whatever is new, the other side knows 
and whatever the other side knows that I do not know that is transferred to me. So, our knowledge 


levels become equal. This is Anti-Entropy. 


So, Anti-Entropy, by definition does not stop because we keep on finding sites at random and we 
keep on exchanging information with them. The other approach is Rumor Mongering, where a site 
distributes updates to other sites. When a site sees that most of his neighbors have the update, it 
ceases to transmit updates with that frequency reduces its frequency of prospecting updates. So, 
we Say that this is like spreading a rumor. So, the rumor in this case is the update. So, that rumor 


ceases to be hot, and gradually it dies away, it fades away. 
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[> infective Already received the update, and willing to propa- 


gate. 
_esusceptible Has not received the update. 
removed Not participating in propagating updates. 


( @ Anti-Entropy 
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@ Simple epidemic A— all the le wit 
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So, we will do a little bit of a mathematical analysis over here. So, we will create three kinds of 
sites, infective, susceptible and removed. An infective site is something that has already received 
the update and willing to propagate. So, incidentally, these terms have been taken from the 


literature of epidemiology, which sees how epidemics propagate? 


And given that I am recording this video in the middle of the corona virus pandemic, these terms 
will appear to be extremely relevant. So, an infective is essentially a person in this case a site that 
is willing to propagate susceptibility somebody who has not a site basically, that has not received 


the update. 


So, it can be infected later and removed is that you are not participating in propagating updates. 
So, in a sense, you have disconnected yourself from the network as far as updates are concerned, 


so that is the terms in anti entropy, it takes longer to propagate updates as compared to direct mail. 


That is true, it does not have a built in termination mechanism, but it is what is called a simple 
epidemic, in the sense that it is easy to propagate information, and that to very easily it is possible 
to propagate. And it is nice and simple is a simple epidemic. In comparison, rumor mongering 1s 


a complex epidemic, because it has a built in termination mechanism. 


But unlike anti entropy, there is a problem. So, anti entropy guarantees that look ultimately all the 
sites will have received the update. But in rumor mongering there is a chance that updates 
((Q)(14:02) reach a node which basically means that there is a chance there is a possibility or 


probability that the updates might not reach a certain node. A node in this case is a site. 
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So, let us start with Anti-Entropy. So, let us say that a network contains S sites and the database 
copy K at a given site smallest. Its value is essentially a tuple. The tuple K has two parts to it. The 
first is the actual value of the update and the T is the timestamp. So, basically any newer update 
will have a higher timestamp. So, this is how we differentiate between old update a new update, it 


was necessary to add this timestamp in the network. 


Because it is possible that messages that circulate older updates might still be alive in the network. 
And this is how we will distinguish between old updates and new updates. And clearly old updates 
should be discarded and the new updates are the ones that should be applied and this will be 


decided on the basis of the timestamp. 


(Refer Slide Time: 


Anti-Entropy 


Algorithm 1: Anti-entropy algorithm 


1 ResolveDifference-push(s,s’) { 
if s.valueOft > s'.valueOft then 

2/| s'valueOf ~ s.valueOf 

3 end 


4} 


5 ResolveDifference-pull(s,s’) { 
if s.valueOft < s'.valueOf.t then 
6 || s.valueOf — s'.valueOf 


7 end i] . 


8} 


9 ResolveDifference—push, pull(s,s’) { 
ResolveDifference—pull(s,s’) 
ResolveDifference-push(s,s’) 


} 
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Algorithm 1: Anti-entropy algorithm 


1 ResolveDifference-push(s,s°) { 
if s.valueOft > s'.valueOft then 

2/|| s'.valueOf + s.valueOf 

3 end 


4} 


5 ResolveDifference-pull(§,s’) { 

if s.valueOft < s'.valueOf.t then 
6 || s.valueOf — s'.valueOf 
7 end f) . 


e) 


9 ResolveDifference-push, pull(s,s’) { 
ResolveDifference-pulls,s') | 
ResolveDifference-push(s,s’) | 


} 
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So, in the anti entropy algorithm per se is easy. So, we have two kinds of anti entropy push and 
pull. So, there are two sides S and S’, what push does is it pushes the updates of S to S’. So, 
basically sees that if S has a higher timestamp, then it sets S’ value to s the difference between 
push and pull is that in a pool S’ essentially sends a pull message and it asks as for the most recent 


copy of data that it has. 


So, the pull basically again does the same thing that is S the lower timestamp than S’, then the s s 
value is set equal to S’ value. So, it is important to look at this once again. So, in push what we are 


doing is S is pushing toa S’. 


12 


In pull what we are doing is that S is essentially asking S’ look do you have any data, that might 
be of potential use to me. So, if S has a lower timestamp, then what it will do is it will kind of pull 
the data from S’ as s will set the value of its data to S’ value. Of course the S’ data has recent more 


recent timestamp. 


So, let us temporarily it is this figure So, pull what happens is that we have two sites S and S’. So, 


s sensate a pull message saying do you have any updates for me? 


And if it does, then S’ sensate an update? S compares a timestamp if it has a smaller timestamp 
well then it takes up a S’ update. So, what is the semantics of push and pull when push S is pushing 
its updates to S’ and in pull s s is asking a S’ look do you have a new update? Anyway S’ has a 
new update it sends it to S and clearly if from s s point of view if it is nill s applies it. A push pull 
is a combination of both the scheme's both push as well as pull. And so that is the reason I am not 


describing it separately. 
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__@ Anti-entropy distributes updates in O(/og(n)) time (see re- 
V- sults from epidemic theory). 
@ Pull-based algorithms 
@ Let p; be the probability of a site remaining susceptible after 
the i" cycle. 
° 


‘| 


/) 
Piss PF] | 
@ Push-based algorithms 
@ Expected number of infective nodes: n(1 — p;) 
@ Probability of not contacting node X: 1 -1/n 
° 
Piss = pi(1~1/n)""-?) 
@ Now, (1 — 1/x)* tends to 1/e as x + x 
@ Thus, for large n, and small p;: 


Pi = pe! 
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__9 Anti-entropy distributes updates in O(/og(n)) time (see re- 
sults from epidemic theory). 
@ Pull-based algorithms 
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° - ; fv A 
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So, let us now do a little bit of math and understand what exactly is going on here. So, what we 
are out to prove is that anti entropy distributes updates in order of login time. This is a result from 
epidemic theory. Try to apply this to a corona virus scenario. In this case, n is the total number of 


nodes. 


So, in a pool based algorithm, let pi be the probability of a site remaining susceptible after the i™ 
cycle. So, for a given site, let us say after the i™ cycle, the probability it is still susceptible is pi. 
What is the probability that it is susceptible even after the i + 1 i cycle. Well, what would happen? 


We are dividing this into rounds, every round. 


The i" node over here will contact some other node. Because it is a pull based system. And if the 
other known as a recent update, it will get it. If that is the idea, then let us see. So, this is essentially 
the conditional probability over here is that what is the probability of the other site to actually 


contain an update, or rather not contain an update? 


So, if it does not contain an update, and since we do not remove nodes and anti entropy, it will still 
be susceptible? So, how do I write the probability? Well, the probability, let us say that that the 
event that I am susceptible, let us say that event is P [x = i + 1], which means i™ cycle I am 
susceptible, is the probability that Iam susceptible in P [x =i] cycle, multiplied with a probability 


that let us see, I contact a node which is given that I was susceptible in P [ y | x =i] 


cycle, I contact a node lead that event y and that node is susceptible. That node is not infected. So, 


then the first probability is clearly pi. 


And the second is that I contact a subset of nodes, which are not infected. So, the subset of nodes 
that are not infected, that is again equal to pi. So, the net probabilities pi into pi. So, what am I 
trying to say I am trying to say that look, this quantity is pi by definition. This quantity, the fact 
that I contact some other known and that turns out to be susceptible, again, arises from the 
definition of probability that is the way probability is defined that it is the size of the interested 


points divided by the size of the sample space. 


So, in this case, the number of interested points, which are the ones that are not infected, still 
susceptible, divided by the sample space is pi. So, that is where the second pi comes from. If I 
multiply them, I get Pi?. So, if you are not able to understand this explanation, I would suggest 


that hold on, let me just try to see if I can clean off parts of the page. 


So, as I mentioned if you are not able to understand what I just said, I would suggest that to rather 
strongly that you look at texts in probability and that will kind of clear up most of your doubts. So, 
that will clear up and that will tell you why? We were able to get pi squared over here, which is 
just a product of two things the product the fact that it is susceptible, and i™ cycle multiplied by 
the probability, that if it is susceptible, i‘ cycle, what is the probability that it will be susceptible 


and i+ 1, i™ cycle, both are equal to pi. 


So, that is where we get pi’. This will happen in a pull based algorithm not in a push based. So, 
the push based algorithm again, this comes from the law of large numbers, this is the right time to 
go to Wikipedia and study the law of large numbers. So, the expected number infective nodes, the 


nodes that are not susceptible, is n (1 — pi). 


That should be, the probability a node is effective is | - pi and the total expected number is n times 


this, this follows from the law of large numbers the weak law in particular, the probability that I 
do not contact a node is 1 - =, So, what is the probability that a given node is susceptible andi + 1 
i cycle? 


Well, again, it is the probability that the node is susceptible in the i cycle, so, the probability that 


this node is susceptible in i cycle multiplied with the probability that no other node contacts it in 


i” round, which is infective. So, no infective node contacts it in the i“ round, that is important. So, 
how many such nodes are there? Well, so the point is that there are n (1 — pi) such nodes, none of 


these nodes contacted me. 


So, for each node, the probability of not contacting me is (1 — =)" (1-P)) which is essentially this 


quantity is the probability that I remain susceptible and i“ round, given the fact that at the 
beginning of the i round, never susceptible. So, we will use a well-known result in calculus, which 


1 1 Z 
is (1- =)* asx >w=— ore™. 


So, this again is a result from calculus that we shall use, and this will tell us that (pi + 1 = pi). well, 
pi comes from there. So, (1 - =)", so, of course, here an assumption that I am making here is that 


pi is rather small. So, this assumption that the probability that a node is still susceptible, let us say 


that this probability is kind of small. 


So, we can approximate this to e~* which is = So, if I go with a small, since even written that is 

a large n and small pi, I can approximately have this assumption that the probability of 
ee 4, at : 

susceptibility every round falls by a factor of 5g You would have encountered in many-many 


others scenarios it is around 0.37. 


So, roughly let us say that every round the probability of susceptibility falls down with 37%. So, 
after a few rounds itself, it will become small very small, and then it will continue to decrease, but 


mind you this assumption is only valid for a small period. 
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So, let us look at these two expressions. So, if pi is small enough, then basically this expression 
reduces a squared because given that these are probabilities, these are actually fractions this 
essentially get squared and then to the power 4 times, 8 times, so on and so forth. Whereas, this 
reduces by a constant, so, pull will reduce the probability of susceptibility which should ideally 


become 0 at the end. 


So, this should tend to 0 as a square and this is why a constant. So, we need to now understand 
which one is better. So, the point is that it all depends on whether it is the beginning or whether it 


is the end. So, let us say this is a big network. At the beginning, let us see if we use a pull based 


17 


method, then nobody is going to get anything. The reason that nobody is going to get anything is 


because there are very few susceptible sites. 


So, there will be no advantage will nobody is nearly going to contact some other few sites that are 
still infected. So, at the beginning, push based methods are clearly better because at least they are 
reducing it by a constant factor. Of course, we have assumed small pi but even if let us say you 
were to do a simulation, you will still see it reducing quite significantly, quite quickly as compared 


to pull based methods. 


So, at the beginning, definitely a push based method is preferred. And what the push base method 
actually does is that it propagates the updates to as many nodes as you can, as many as it can. 
Gradually, as the updates have disseminated, there will still be a small minority of nodes that have 


not gotten the update. 


Well, then they can revert to a pull based method, where they will contact nodes that have reached 
the bottom to update or infective and then they can fetch the update from there. That is why we 


say that pull based methods are better at the end. 
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_f@ Push based methods are better at the beginning 
{7@ Towards the end pull based methods are better 
@ Instead of comparing entire database contents, we can do 


better: 
o First compare recent entries|(Less than r seconds old) 
@ If they match, then nothing needs to be done. i 
@ If they do not match, then update recent entries, and com- 
pare check sums of the rest of the database. 
@ If the checksums do not match, then synch. databases. 
2 


64-bit 
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So, what was the thing? from math, we can say that look at the beginning push based is better, 
because, you are just coming into the network with an update. So, contact as many people and 


spread the update. How it, twits the end, pull based methods are better. Because even if let us say 


we are trying to do push, I will still not be able to contact the ones with high probability, of course 


that are not received the update. 


But in pull base methods, the nodes that are not received, the update will continue fetching them. 
And as they get into a minority the probability of they actually hitting a node, that has the update 
kind of increases. So, that is why any method has pushed at the beginning pull at the end. So, this 
is anti-entropy for you can be optimized. Instead of comparing the entire database contents, well 


compare the timestamps of the recent entries. 


If the timestamps match, well, nothing much needs to be done. Because it will tell you that earlier 
entries are there. And if they do not match, well then update the recent entries and compare what 
are called check sums. Check sums are, essentially think of them as advanced versions of parity 
bits, where let us say if you have a 64-bit check sum, it kind of uniquely identifies the contents of 


the entire database. 


So, the entire contents of the entire database, is kind of compressed to a 64-bit number. And that 
uniquely identifies (()) (31:55) hash. It is a hash the checksums do not match which means the 
hashes do not match means that there will be one small little difference somewhere. So, there is a 
need to transfer the entire databases and sync it otherwise per se, there is no need, we can compare 
hashes for older entries and for recent entries, very recent entries are there we can compare the 


entries itself and entries themselves. 


(Refer Slide Time: 32:24) 


Epidemic Protocols Rumor Mongering 


Outline 


@ Epidemic Protocols 


» ® Rumor Mongering 


Smruti R, Sarangi Communication between Nodes 


So, we have looked at anti entropy, we have looked at the push based mechanism we have looked 
at the pull based mechanism, we have said that look, this is better at the beginning. And pull is 
better at the end. Now, we look at rumor mongering which is a second kind of epidemic protocol. 
So, why epidemic protocol because these protocols have come out of epidemic theory, something 


that you see in excess in these COVID pandemic days. 
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Terminology 


@ s fraction of nodes that are susceptible 
@ i fraction of nodes that are infective 
@ 1 fraction of nodes that are removed 


Governing Equations 


@ Sum of nodes is 1 oa 
s+itfe 1 


Smruti R, Sarangi Communication between Nodes 


So, here also will have the same idea of susceptible, infective and removed. In this case, it is not 
the number of nodes but rather the fraction. So, that is why we have s +1+r= 1. So, these are just 
the fractions. So, recall that this remove category was not there in anti-entropy, but the remote 


category is there in the rumor mongering, s +i1+r=1. 
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So, we will use some calculus here to derive the results. So, let us look at the rate of decrease of 


ds 


; ; ds. 
7 will have a (-) sign because, well 7 is pretty much the rate 


; i 
susceptible nodes. So, this is a So, 


of increase, but the negative sign makes it the rate of decrease. 
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So, this is clearly proportional to two quantities. It is proportional to the fraction of susceptible 
nodes because clearly if there are more susceptible nodes then more nodes will get contacted, more 
nodes will get infected and more nodes will enter the infective category. It is also proportional to 


the number of infective nodes because more or the number of infective nodes. 


Hired is the proportion of the number of infective nodes proportionately hired, is the proportion is 
the rate of decrease of susceptible nodes. So, well, these equations do hold, let us say for a steady 
state range. Fare let us say, the number of infective nodes is not much it is clearly not saturating. 
Similarly, the number of saturating nodes the number of susceptible nodes is also not extremely 


large, fair, this may not hold or it is not extremely small. 


So, there are clearly ranges where this equation holds exactly and ranges where this equation does 
not hold all that well. But let us say that these are all empirical laws in the sense they have been 
observed to hold at least most of the time. And they also can be derived from the simple informal 
explanation that I gave that the rate of decrease of the susceptible nodes is in a sense, proportional 


to both the quantities and both the quantities are thus we consider a product of these quantities. 


Furthermore, lead nodes lose interest in propagating rumors by a probabilistic factor of e which 


means that nodes will gradually lose interest in propagating the infection in this case, it is a rumors 


; i 2 Dace a : : : di ia : ; 
with the rate e So, this equation is derived like this the = which is again the rate of increase of 


infective nodes. Well, this for reasons similar to the previous reason is also proportional to si, 
which means proportional to the product of the fraction of both susceptible as well as infective 


nodes. 


And the reason is similar to the reason that we gave over here however, there is a damping factor. 
So, let us understand this damping factor. So, this damping factor is clearly proportional to the 
number of infective nodes, which should be the case because higher is the number of infective 
nodes, higher should be the damping, it is also proportional to the number of nodes that are not 


susceptible. 


So, if we go back to this equation over here, then the nodes that are not susceptible are either 
infective or removed. So, basically it is proportional to (1 — s). So, (1 — s) is basically the non- 


susceptible part of the network and as that increases that does put in a damping effect and here is 
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the factor = that we talked about, and the factor k does determine how quickly the rumor dies or 


how quickly the rumor feeds away. So, as I have said the equations have been derived a 
combination of two things the first is empirical observations and they do tend to hold for a large 
range of i and s, they do this approximate relationship does hold and also from the informal 


arguments that I just provided up till now. 


Well, informal is a bad word, let me call it the semi-formal arguments that have been provided to 
you. so, they basically look at the interaction of these two factors. And so, the interaction of these 
two factors pretty much determines how the numerator is going to propagate. So, if I were to solve 
both of these equations, which would be this equation over here, and this equation over here, if I 
were to solve them, then I can always find the rate of infective nodes, the fraction of infective 


nodes as a function of s. 


So, this turns out that it is a sum of a linear function and a log function with the damping factor 
playing a very key role as you can see over here. So, what are we getting to see over here? what 
we are getting to see is that T we have eliminated and we basically have i as a function of s. And 
of course, we can see the damping factors play a key role. So this, the derivation of this is clearly 


given in the paper, and you can take a look at it. This is just a brief summary of what I am showing. 


And given that we are in the middle of a pandemic. now, I would invite all of you to take some 
maybe Coronavirus data and try to fit it with whatever you can whatever you think is the damping 
factor. So, the damping factor is can be derived from what they are quoting as the odd number, 
which is the number of people that one person infects during this is his or her Coronavirus 
affliction. So, the damping factor can be derived from that. So, for different damping factors, you 


can fit these curves and see how the infection will evolve over time. 
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Implication 


There are still some susceptible nodes. They decrease expo- 
nentially with k. 
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So, if I were to plot it as a fraction of s, so, initially of course, all the notes are susceptible. So, we 
will start from here and we will walk backwards this way. So, if we see k = 1. so if k = 1 what we 


get to see is that we see a curve like this. So, basically, the multiplicative damping factor is actually 


1 1 
-. So, -=1. 
ROO 


So, what we are seeing is that the data of infective nodes will increase, increase, increase and then 
abruptly fall to 0 not abruptly, but let us say in this range, it will fall to 0. So, ultimately, we will 
have Pheno number of infected nodes is 0, which means no more infections can actually happen, 
no more nodes will actually get the updates, we will see that around 20% of the nodes are 


effectively left out. 


So, they are still susceptible or they are left out. So, this clearly depends on the damping factor. 
So, if I were to make k = 1.5, which will make that effective product = So, we see a shift of this 
curve we see a shift of this curve towards the left and gradually as a damping factor k, I should 


actually call the damping factor is -. 


So, gradually as that reduces or as k increases, we will reach this point and the moment that the 
curve actually reaches this point, it will basically tell us that look at this point, no node will be left. 
Uninfected in a sense all the susceptible nodes will become infected. So, there will be no node that 


is not susceptible, which means that it is infected. And of course I show the figure curve here where 


k = 2 and k = 2 which means a= 0.5. 
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That is clearly too much in the sense we can see that all the nodes at the end will remain infective. 
And the number of susceptible nodes will actually fall down to 0. So, this is the key point over 
here, that at the end over here, all the nodes still remain infective and no node is susceptible and it 


is clearly not removed from the network. 
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Let us now see what percentage of nodes are still susceptible, 
when no other node is infective i(s) = 0. 


|s=e v3 | 
c 


@ Exponentially decreases with.k. 
@ Some nodes still remain susceptible. 
@ This value is called the residue 
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So, now, let us see what percentage of nodes are still susceptible when no other node is infected 


that is i(s) = 0. So, which is essentially the nice point where we try to kind of estimate this residue 


—k+1 
over here which is this much. So, this can easily be derived to be this s = e 1-s . So, we can see 


that it exponentially decreases with k and this value is called the residue and some nodes will still 


remain susceptible. 
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residue Sites that are still susceptible after the end of the 
epidemic. 
traffic Average number of messages sent per site. 
@ mupdates per site,(n sites, total nm updates | 
@ Chances that a site will miss all the updates: 
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So, let us now look at a few more fundamental relationships. So, the residue is the sites that are 
still susceptible after the end of the epidemic and the traffic is the average number of messages 
that we sent per site. So, let us say we send m updates per site, and there are a total of n sites. So, 


we will have a total of nm updates. 


The chances that a site will miss all the updates and still remain susceptible is (1 - -). So, we have 

seen this before, this is the probability that any other site misses this multiplied with itself nm times 
a 1 i ge ue ee : — : 

which is S = (1 - aes which is e~™. So, what we can see is as the traffic increases in the sense 


number of updates per site, as that increases, the residue decreases and ultimately the residue tends 


to 0. 


So, if you just increase the traffic, then the residue will reduce, reduce, reduce, reduce, and 
ultimately it will come down to 0. And how will it come down? Well, it will come down using this 


exponential relationship over here. 
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So, then, what is the key that we key learning that we got from here that rumor mongering can 
miss sites it clearly can, because there is a residue. After a certain time, we can run the anti-entropy 
protocol. So, what we can do is we can do rumor mongering for some time. And the rumor 


mongering is nice good it is self-terminating. 


But after a certain time, we can again run a background kind of slow anti entropy protocol to ensure 
that all the nodes are genuinely covered. And let us say whenever they discover a missing update, 
you can also add a threshold to this, they can then start a hot rumor in their local region such that 


we can quickly cover their local region and the rumor itself will tie your fade out over time. 


So, Xerox Clearinghouse did use many of these mechanisms. In addition, it did some amount of 
redistribution via direct mail also, this is something that we have not discussed. So, what we have 
discussed is primarily anti entropy there we said look, look at the beginning you follow a push 
based technique then towards the end you follow the pool based technique then we discuss rumor 


mongering. 


So, we talked about the residue. I will also have said that look the residue is kind of e~™. So, 
basically increase in traffic per site and the residue will quickly fall down to 0. So, the details are 


given in the paper of how exactly the residue and remove nodes etc are calculated. 
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But this should be enough to at least give you a feel of what are the broad mechanisms of these 
epidemic protocols. So, these are modeled on the lines of an epidemic. So, that is the reason they 


call them epidemic inspired protocols. 
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So, now let us look at other final aspects, deleting notes and spatial distribution. 
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We can treat the deletion of an item as an update , and issue 
a death certificate. The death certificates can be propagated 
through rumors or anti-entropy. When a death certificate meets 
a later update, the update gets cancelled, 


@ When do we discard death certificates? ) 


@ Need to define a time threshold. 


@ If adeath certificate is older than the time it takes to propa- 
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So, we can also treat the deletion of any item as an update the same way that when we modify an 


item we call an audit and update. We can also treat the deletion as an update and issue a death 
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certificate for a node. The death certificates themselves can be propagated via rumors or anti 
entropy. And of course, when a death certificate meets a later update for that item, it basically 
means that that item, whichever site updated, the item did not get the death certificate. So then, of 


course, the update gets cancelled. But of course, we need to discard death certificates. 


Well, if you are using rumor mongering, they will ultimately fade out. Otherwise, we define a time 
threshold. If a death certificate is older than the time it takes to propagate the update to all sites, 
we can delete it. So, essentially some sort of Max threshold. And if let us say the current time 


minus the time or the death certificate is more than that, we just delete it. 


At some sites, we can still maintain called retention sites, we can still maintain the death 
certificates. So, is that later on, let us say a long, long time later, there is some update hiding 
somewhere, then that update can be checked against this death certificate. So, the retention sites 


can do that. So, something similar was also used in the Xerox protocol. 
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update, then activate the death certificate and propagate it. 
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So, now let us come to the idea of a dormant death certificate. So, this is that we keep a death 
certificate only at a few nodes. If it collides, there should be if it collides with an update, then of 
course, we activate the death certificate and propagate it, which is what we discussed in the 
previous slide, as well with regards to retention sites. So, what if a dormant dead certificate meets 


an obsolete update? In this case, we reactivate the dormant death certificate and distributed. 
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It is possible that a legitimate update can be canceled if we do not set its time properly. So, of 
course setting and managing the time is rather important. So, we can have this can be solved by 
using version numbers for updates. So, what will happen is that a death certificate can have two 


timestamps, so original timestamp, and an activation timestamp. 


The original timestamp will be used to cancel updates. And the activation timestamp will be used 
to ultimately get rid of the death certificate, which means that if let us say the activated death 
certificate has been propagating for a long time, we can discard it. So, what is the broad idea? 
Well, the broad idea is that look, in an unstructured network, our messages can be floating around 


undetected. 


It is a large network, some node will have some message later on, it will wake up and start 
propagating the message. So if you delete a node, it might still not lead to an actual deletion. 


Because it is possible that maybe long time later some node will show up with a message. 


If you delete not a node, but if you delete some data, a later update can always be there lurking 
around. So, that is the reason we had this notion of death certificates stored at retention sites. So 
they will have an original timestamp. And whenever there is a collision of the dormant death 


certificate with an update, you just look at the timestamp. 


The timestamp is after the data died. Well you can sell the update. But then there is a need to 
reactivate the dormant death certificate and again re-circulate because it appears that there are 
several sites that have not gotten the original death certificate. Now, the question is what should 


be the time on it? 


Do not change the original time that remains, but also have an additional time called the activation 
time. So, why is this time required? This time is required to see for how long has, this version of 
the death certificate been in circulation. If it has been in circulation for a long time to reduce the 


traffic in the network, we can just kill that dormant dead certificate and remove the messages. 
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Now, a few more issues so we will discuss some results of course without proof, the proof is there 
in the paper and in the references. Consider the fact that it takes time to send a message depending 
upon the distance to the destination. Some known results again presented without proof. If a node 
can contact only its neighbors it takes order n time to spread an update using anti entropy So that 


should not come as a surprise. 


So, let us say if this is a ring, and let us say this one site, if I can only contact my neighbors, this 
is how I go and it will take order n time. However, if a node can contact any other node, it will 
take order of log n time to actually transmit the message using anti entropy. And this again should 
not come as a surprise because initially we just do a push-based system and a push-based system 


we reduce something by a constant factor. 


And let us say if this factor is e, and let us say we have k search rounds. So, in every site we reduce 
aa 1 ; ; ; 
the susceptibility by 2 So, we can also approximate that to saying that we increase the number of 


susceptible sites by e, which again is not totally correct, but serves as an approximation kind of, 
and so, then we can say that for e*= n, well, this means that the natural log with respect to e of n 
=k, which has a number of rounds. So, that is roughly how the term order of log n comes, but of 
course, we do push at the beginning pool at the end does not matter already, it just takes log n 
steps, you can kind of keep that in mind without going into the details of a rigorous proof that this 
is the time it takes anti entropy to reach convergence or roughly all the nodes. We can then kind 


of extend this with some specific results. So, let the probability of connecting to a site at distance 
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d. So, it is less distance be defined as the number of hops in the network let the probability of that 
be d~*. So, this is basically d raised to the power some quantity fair is a > 2 which is basically it 


is an, a> 2. 


So, this is kind of stronger than the inverse square law. So, it will take a long time to converge 
which is O(n") for convergence. And this kind of can be seen as a generalization of this result. On 


the other hand, if a < 2, which means it is kind of weaker than an inverse square law in the sense 
maybe it is = or aa something like that, then it will take order a poly log time for convergence 
again a generalization of this result for convergence, it will take O(log(n)*). So, that important, I 


am just moving to another slide to give myself some space. 
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@ Problem: A set of nodes fail. Design a failure detector that 
detects failures by gossiping. 


@ Model of failure: Fail-Stop = If a node does not respond 
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Paper 


5 A Gossip Style Failure Detection Service by Robert Renesse, 
* Yaron Minsky, and Mark Hayden (Technical Report) 


@ Problem: A set of nodes fail. Design a failure detector that 
detects failures by gossiping. 


< @ Model of failure: Fail-Stop = If a node does not respond 
|  toamessage for T seconds, then it has most likely failed. 


|The algorithm needs to scale in terms of the number of 
nodes, rn. 
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So, the important point over here. Just a continuation of the previous slide. The important point is 
that look in any epidemic based algorithm if you want to achieve a poly-log time convergence, 
which means the message reaches everybody particularly using anti entropy, then you need to 
contact random nodes with a probability. So, this basically means that just contacting neighbors 
and so on is not good enough. There should be some facility of contacting even faraway neighbors 


with a reasonably decent probability so that the epidemic can spread faster. 


So, coronavirus, does not work that way and Coronavirus is still a neighbor to neighbor 
transmission will depending upon how neighbors are, because so that is the reason it has not 


engulfed the entire world yet. 
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But in an epidemic kind of scenario. If we really want to reach everybody, then a < 2 and if we 


want to restrict it a > 2. So, this is when from a poly log it will become a polynomial. 


So, we did take some space out of this slide to kind of discuss the results of epidemic based 
distribution. So, now, what we will do is first apologize for using the real estate of the gossip 
algorithms slide and then discuss gossip based algorithms. So, gossip based algorithms are of a 


different kind, they are of a different nature, they are not based on epidemics. 


So, the key reference that we will use is actually this paper a gossip style failure detection service 
and this is one of the seminal references in this area. So, it is that if a set of nodes fail design a 
failure detector that detects failure by gossiping. So, gossiping is basically that, you just pass on 


that information to whoever you know. 


So, the model of failure is fail stop, which means, if a node does not respond to a message for T 
seconds, we see that it has most likely failed. And the important thing is similar to rumor 
mongering and anti-entropy with n algorithm needs to scale. So, we will look at such gossip based 


mechanisms. 
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Aims of a Protocol 


@ Probability of a false positive is independent of( ay 
@ Resilient to message loss and network partitions. 
< @ Scalability in detection time: O(nlog(n)) 
@ If clock drift across nodes is negligible, then the algorithm 
detects all failures with a known probability of a mistake. 
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Next will now discuss the gossip protocol. So, the aims of the protocol are as follows. The first is 
that the probability of a false positive which means that we say that a given node has a failure, but 
actually it is not failed. This is the probability of a false positive, this is independent of n. So, this 
is a rather strong assumption, or rather a strong aim in the sense that regardless of the scalability 
of the network, our probability of a false positive is bounded and it is not dependent on n. 
Furthermore, our protocol is resilient to message loss and network partitions. So, regardless of the 


fact that, networks and partition messages or loss protocols can still function. 


And the scalability in detection time is achieved. In the sense the time it takes to detect all failures 
is order of n login. And if the clock drift across the nodes is negligible, then the algorithm detects 
all failures with a known mistake probability. And the increase in bandwidth, which is pretty much 
the number of messages is linear in terms of the number of processes. So, the key operative point 
here is that the probability of false positive is bounded. The detection time is O(nlog(n)) roughly 


linear and the bandwidth is linear. 
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@ Each node maintains a message list. (member_id, times- 
tamp , heartbeat counter ) 

@ Every Toossip seconds, each node updates its heartbeat 
counter, and sends a gossip message to a randomly cho- 
sen node. 

@ The gossip message contains the member ids, and their 
heartbeats, 

@ The receiver merges the message lists (takes the larger 
heartbeat), and adopts the larger heartbeat counter for a 
node. 

@ The timestamp for a member indicates the last time that 
the receiver thinks that a node has updated its heartbeat 
counter. 
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So, what we do is that each node maintains a message list, write a list of messages. And each of 
the items in the message list contains three fields. The fields are member id, timestamp and 
heartbeat counter. So, Member id is clearly some information about another node. So, the idea of 


that two node is being called Member id. 


So, what we do is that every Tgossip seconds each node will update its heartbeat, increase its 
heartbeat counter and send a gossip message to a randomly chosen node something similar to anti 


entropy is done here. 


So, the heartbeat is a monotonically increasing counter and an increased heartbeat tells the rest of 
the nodes that it is alive. The Gossip message contains the member IDs and their heartbeats. So, 
basically the gossip message can also include the message list which is the Member id timestamp 


and heartbeats of all the information that a given node has. 


The receiver will merge the message lists which basically means that for the same member id it 
will take the larger heartbeat and it will adopt the larger heartbeat counter for that node. Finally, 
the timestamp for the member indicates the last time that the receiver thinks that denote has 
updated its heartbeat counter. So, let us say that the heartbeat counter for node number 20 fer this 


message list is a part of known number 10. It is heartbeat currently is 11. 
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So, the timestamp for this entry essentially indicates the last time that the receiver thinks that node 
20, updated its heartbeat count to 11. If in a world where clocks are perfectly synchronized, the 
timestamp can be a global timestamp in the sense that whenever a node increases its heartbeat. In 
all messages, it can just attach its timestamp and it can send the message then all nodes will, since 


they uniformly agree on the time, they can, they will be sure that look at a certain timestamp. 


Node number 20 incremented its heartbeat. But in a world where timestamps themselves are not 
synchronized, there is no clock synchrony. The timestamp itself is an approximate quantity. So, 
the timestamp for a given member will indicate at best an approximate estimate of when the 
heartbeat was updated. So, the way that we will set the timestamp is that whenever we receive a 
message about a member, we will you know directly from the member we will set the timestamp 


to that right the time of receiving a message if their clocks are not synchronized. 


(Refer Slide Time: 1:04:23) 


Protocol 


Based Protocols 


Failure Detection 


Failure Detection 

@ Ifthe heart beat counter for a node hasn't increased in{ Ta 
seconds, then a node presumes that it has failed. 

@ Let the probability of a false positive be bounded b 

@ However, the entry is not removed because a node can con- 
tinue to get gossips about the failed node, possibly from 
other nodes. 

@ It thus waits for (more timeand removes the node after 
Toleanup Seconds. 

@ Let the probability of the node still being alive be bounded 
by P, cleanup. ‘ . ——4fak 

® (P; ail =P Geel if Teleanup =2x Tait | ae 


Smruti R, Sarang! Communication between Nodes 


So, how do we detect if failure? Will detecting a failure again is an approximate activity say the 
heartbeat counter for a node has not increased in Tygj, seconds, then a node is presumed to have 


failed. So, let us assume that for whatever reason, either messages got delayed or there was a 


partition in the network or the receiver had crashed and then it recovered. 


There is a probability of a false positive and this probability of false positive in this case is bounded 


by Praiz which basically what is false positive we think a node has failed, but actually it has not 
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failed, then what will happen is that we will incorrectly conclude that the node has failed. However, 
the entry in the message list is not removed, because a node can continue to get gossips about the 


failed node from other nodes right it can continue to get gossips. 


So even after it concludes, it has failed it will not remove the entry it will rather keep it. So, it does 
wait it does waits for some more time and finally removes the node after a T,jeanup Seconds. So let 
the probability of the node still being alive be bounded by Peieanup the probability that after 


Tcleanup Seconds the node is still alive. 


Another variant of the false positive so both of these probabilities Prgiz = Pcteanup 1S equal to. So, 
whenever both of these probabilities equal. So, we can see that, after Trai, seconds, we conclude 
that look node has failed. But again, there is a mistake probability in the sense there is a probability 
of false positive of Prai, . After that, we wait for some more time. So, this entire duration, we can 


call it as Teyeanup - 


So, the question is that given the fact that we have come here, and in spite of that, if the node, we 
again, wait for this much of time, in spite of waiting for this much of time, if the node is still not 
responsive, can we conclude it as failed? Maybe Yes, We can and we can remove the list, and we 
can remove it from the list, but the probability of a false positive in this interval, which is again, 
conditioned on the fact that, we waited for Tygizeq Seconds, and we did not hear anything about the 


node. 


If we want this probability Pojeanup = Praiy then it is obvious that this interval should also be equal 
to Trai, . Hence, Tereanup = 2 X Tfai, . So that is when both of these probabilities will be equal. But 
as I said, there is no specific need of making them equal other than for some degree of mathematical 
elegance. So, you can refer to the paper, but in general, having mathematical elegance while 
analyzing such complex systems is desirable. And to do that, all that we need to do is we need to 


set the T cleanup =2X Tyait ‘: 
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Mathematical Analysis 


Analysis 
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Let us do some more math. So, let us assume that f out of n members are failed. So, we need to 
detect all f, k out of n members are infective. So, infective means that they are spreading the 
message about failed nodes. And the third point is only one node sends one message to another 
node in the round. So, in a given round, only one node sends one message to another node in a 


round. So, what is the probability of incrementing the number of infective nodes? So, the 
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probability of incrementing it let us say that the current number of infective nodes is Pin-(K), the 


probability of infecting it is given by this expression. 


So, let us work it out. So, the first factor is the probability that one of the infective nodes actually 
sends a message and this is A because we are assuming a uniform probability of sending a message. 


Next, what is the probability of a node that is not failed and not infected getting the message. 


So, the number of such desirable nodes for the sake of this probability calculation is n minus the 
number of failed nodes minus the number of infective nodes and n- f- k. And of course, there are 
n -1 choices, so, the total sample space is n-1 and thus the probability is this and since they are 


independent events, they get multiplied. 


So, the probability of having k infected members in round i plus one let it be P(k; + 1). So, the 
P(k; + 1) =k we can infer the following recursive relationship. So, here I would like to make a 
point that whenever a system is complicated and difficult to handle, and we cannot come up with 
a closed form expression, we often try to set up some sort of a recurrence relation and recurrence 


is related to recursion, where essentially is the same expression repeating over and over again. 


So, in such kind of a scenario, where we have set up a recurrence, what we can see is that in the (i 
+1) it round if we have k infected members, in a sense k member who have some information 
about the number of failed nodes. So, this is equal to either P;,-(K — 1) X P(k -1) which means i 
th round we had k - 1 infective members, and then there was an increment and the increment is 


given by this expression. 


Or in the i th round, we already had k infective members, and there was no increment which is 
given by this expression. So, both of them can be calculated. And we can use this recurrence 


relation to nicely calculate the P (k; = k), this can be done. 
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So, the probability that there is some process that does not get infected by let us say process P after 
r rounds, let it be Pmistaxe(P.r). So, we will use this expression this probability P over here. So, 


Pmistake(P.r) = 1- P(k, =n-—f). So, let us understand this slowly. 


So, let us first break this expression down into its sub components and understand each of these 
components one after the other. So, the first thing says that look at the end of r rounds, what is the 
probability that (n — f) nodes are infective. So, what where does (n — f) come from? n is the total 


number of nodes, f is the number of failed nodes. 


So, (n — f) refers to all the nodes that have not failed. So, if all of those nodes at the end of the r th 
round are infective, this probability is given by P(kr = n — f). So, if at least one of the nodes right 
that is it there is at least one. so, some process means at least one process is still infected is given 


by Pmistake (PP) = 1—P(kr=n—f). 


And so, then this was for a given process P. If I want to look at all the processes, well, then I can 
use this simple result of probability theory, where Pmistaxe (t) = U Pmistaxe (P. 1) which is kind of 
< it is upper bounded by. So, Pmistaxe (P, 1) is given by this expression over here and it is upper 
bounded by the total number of non failed nodes which is( n — f) times this quantity over here. So, 
which means that we have an upper bound for Pyistaxe (1), which is this expression. So, let is keep 


moving. 


MM 
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So, let us discuss the performance results. So, the experiment that was performed in this paper is 
the number of members this is this number is increased from 1 to 200. So, what we see is that the 
failure detection time varies linearly in the log scale. So, it is the probability is 10~? increases from 
roughly 0 to 250 seconds 107° from 0 to 10 and 1073, 0 to 150. So, if one member is failed, so, 


the bandwidth requirement is 250 bytes per second and the time = O(nlog (n)). 
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So, if let us say we vary the mistake, the mistake probability right that we derived two slides ago 
if we vary the mistake from 107° to 107? on a log scale, what we see is for that 150 members the 
detection time reduces linearly in the log scale from 200 seconds to 95 seconds with 100 members 
the detection time again reduces linearly from 130 to 160. And with 50 members it again reduces 
linearly from 60 to 25 seconds. So, this is the essentially detection time versus the mistake 


probability the mistake probability is something that we are willing to tolerate. 


(Refer Slide Time: 1:15:25) 


Catastrophe Recovery 


Outline 


Q Gossip Based Protocols 


@ Catastrophe Recovery 


Smruti R, Sarang! Communication between Nodes 


Catastrophe Recovery 


Catastrophe Recovery 


®@ Gossip algorithms do not work in the case of network parti- 
tions 

@ The failure detector needs to broadcast messages to re- 
establish connections. 


Smruti R, Sarang! Communication between Nodes 


Now, let us discuss catastrophe recovery. So, gossip algorithms where essentially what we are 


doing over here is we are gossiping, they do not work in the case of network partitions. So, the so 
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a dedicated failure detector needs to broadcast messages and it needs to reestablish connections. 
So, unless you reestablish connections messages will not go to the part of the network that has 


been partitioned. 
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So, we can have a broadcast protocol where each second and node probabilistically decides to send 
a broadcast. This probability depends on the last time a node received a broadcast. So, let us say if 


a node received a broadcast 20 seconds ago, then it broadcasts with a very high probability. 


And one of the functions of this form p(t) = a which fits well. So basically, the idea is that if let 


us say a node has been receiving broadcasts from other nodes, it can reasonably be sure that the 
network is connected. But let us say a long time has elapsed and it has not gotten any broadcast, 


then it tries to re-establish connections. 
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Gossip Based Protocols Catastrophe Recovery 


[® A Gossip Style Failure Detection Service by Robert Re- 
nesse, Yaron Minsky, and Mark Hayden (Technical Report) 


[if Epidemic Algorithms for Replicated Database Maintenance 
by Alan Demers. Dan, Greene, Carl Hauser, Wes Irish, 
John. Larson. Scott Shenker, Howard Sturgis, Dan Swine- 
hart, and Doug Terry, PODC 1987 
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So, these were the broadly speaking the two papers that we discussed in this lecture. So, the first 
was epidemic algorithms. So, of course, it is mentioned in the reverse order. And the second was 


a gossip style failure detection service. 
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Welcome to the lecture on first and second generation peer to peer networks. So in this lecture, we 
will discuss two very important networks that, unfortunately have acquired a little bit of a bad 
name primarily because they were used to share unlicensed, and in some cases illegal content. But 
they have helped to a large measure determine the direction of work in this area. And furthermore, 
most advanced distributed systems today, or they arise to search peer to peer technology. So we 


will discuss Napster and Gnutella. 


So Napster was considered the first generation peer to peer network. So what is the peer to peer 
network? Here we have a set of machines. No machine is more privileged than the other in general, 
and they cooperate among each other to provide a large number of files the case of Napster music 
files to a worldwide community of users. And a second generation kind of Gnutella is an 


improvement over Napster. We will discuss how? 
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So the overview of this we will discuss the protocol security and privacy issues with Napster. Then 
we will provide an overview of Gnutella. So we will discuss the details. And finally, we will do a 


little bit of comparison of Gnutella and Napster. 
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| History of Napster 


@ The mp3 format was the first widely used format for music 
files. A 5 min song required 5 MB of storage, which was 
pretty reasonable. 

@ Sites like mp3.com were the earliest mp3 sharing sites. How- 
ever, most of the time links were broken. 


@ In 1999 Shawn Fanning observed: 
@ Need a dedicated search engine to find mp3 files only. 
@ The ability to trade mp3 files with other users. 
@ Find and chat with other mp3 users online. 


The provider needed: 
@ Napster installed on the computer. 
@ Ashared directory 
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So what is the history, so this history is roughly towards the end of the 90s. The mp3 format was 
discovered mp3 is an audio format. So what people those days used to do is that they would record 
these wonderful five minute songs is five minute mp3. Idle, roughly an MB a minute. So five 


minutes was five megabytes, which was something just about the internet those days could handle. 
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And the main advantage by the way of mp3 was that as compared to the previous format, which 
were wav files, mp3 format was somewhat much more compressed without any great loss in 


quality. 
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So that kind of started a boom of mp3 sharing. And a large part of that sharing was an unintended 
consequence of technology, like Napster were a large part of this was illegal. And Shawn Fanning, 
who invented Napster created a dedicated search engine for mp3 files only. The ability to find 
trade mp3 files, chat with other users and an entire community meant to exchange and distribute 


mp3 files, audio files, music files, to whoever required it. 


So the provider needed Napster installed on the computer. And there was a large number of shared 
directory servers over the net, we will discuss that. But primarily all that was required was a 


Napster client on a computer, nothing else was required. 
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@ User opens the Napster utility. 
@ She logs on to the server. 
@ The server updates its database with the list of files in the 


Client's shared directory. 2 


@ The client types a search term for a query. 
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containing the song. 
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The basic protocol now, so what the user used to do so this is a first generation peer to peer network 
around 1999. So what the user did or let us put it this way, all that the user had to do was open the 
Napster utility, log on to the server. The server would update its database with a list of files in the 


client shared directory. So every client had to share a directory with the mp3 file so the client owns. 


So in a certain sense, these files are made available to the community. Then let us say the client 
wanted a certain file, the client would search and write a query string and search for that kind of 


file. The server would respond with the IP address of the machine that contains the song. The client 
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would establish a connection with the machine with that IP address and get the song. So all that 
the server would do in this case, is that it would act as a broker. So once the client logged in the 
server would get a list of songs that the client has. And furthermore, if there is a search query, the 
server would essentially connected one more client and then a direct connection could be 


established between them and could be transferred. 
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So what was the architecture again? Well, the architecture those days, given that it is a first 
generation peer to peer network, this was in centralized architecture, with one central Napster 


server in the middle with a lot of client machines around it. 
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Napster Protocol 


@ Itis a client-server architecture. 
@ Server ( broker ) runs on port 7777, 8888 or 8875. 
@A message to/from the server is of the form: 


<length><type><data> 
length Describes the length of the message (2 bytes). 
type error/login/login ack/ version/upgrade 
data The actual data. 
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So it was a quintessential example of a client server architecture where we have one centralized 
server. The server runs or used to run on these three ports. 7777, 8888, 948 and 8875. So when 
universities later on started banning Napster, primarily because songs without legitimate legal 


licenses were being shared. What they basically did is that they closed down these ports. 


And to a large extent, at least, most Napster clients had a major issue. So the message format 
between the client and the server was very simple. It was length, type and data. The length 
describes the length of the message, it was limited to two bytes, sixteen Bits, the type well, error, 
a login message to login with the password, and login ACK and acknowledgement message. And 
also what was the version of Napster that has been used, and also whether the version of the 
Napster client need needed to be upgraded or not. All of this was included in the type of the 


message. And the data in this case was the actual data the actual data that was being transmitted. 


So this is what, in a nutshell, Napster message actually contain, which is very simple just length, 
type and data. So nothing. So if you look at Napster right now, in 2021, when I am recording this 


video, Napster would appear to be a very simple protocol. 
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Protocol Overview 


@ There are mainly three kinds of nodes: clients, lookup servers, 
brokers (Servers) 
@ The client first finds the address of a broker by contacting a 
lookup server. a 
@ The lookup server finds the least loaded broker. 
@ The client exchanges messages with the server using TCP/IP. 
@ The server can give it the address of a peer which contains 
a copy of the data. oy 
@ A napster peer has five concurrent entities running: 
@ Main coordination: connection and communication with bro- 
kers 
e Listener: handles incoming connections from peers 
@ Upload, Download, and Push instances: transferring files 
between peers 
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So there are mainly three kinds of nodes, clients look up servers and brokers, so brokers will also 
called servers. So the client first finds the address of a broker by contacting a lookup server. So a 
broker is basically the server that does most of the job, but the lookup server is you can think about 
is a special kind of a server, which just gives the address of the least loaded broker in terms of the 
amount of work it is doing. The least loaded broker to the client, the client. The will establish a 


regular TCP IP connection. 


So at this point of time, if the reader or the viewer of this video, reader of the paper and viewer the 
video, are not sure about some networking concepts, such as TCP IP, proxy servers, and so on. I 
would request them to kindly look it up. So the client will establish a regular TCP IP channel with 
the server. The server can give it the address of a peer. Peer is in this case, a special term because 
the click peer is essentially one more client, which contains a copy of the data and in this case is a 


song. 


So Napster peer will have five concurrent entities and entities essentially software thread. So it 
will have a coordination entity. So the job of this piece of software is to connect and communicate 
with brokers, a listener, which will handle incoming connections from peers. So if there is a peer, 
which is trying to connect to it, it will handle incoming connections, upload, download and push 
instances. So we will have ample opportunity to take a look at what these software components 


are. So these will essentially help in transferring files between peers. 
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So if I were to look at the flowchart of Napster, this is what it would look like. So let us look at 
the online state. let me maybe start with the offline state. So after offline, when the user wants to 
log in, the user will connect to a Napster lookup server, that will find the best broker then its 
connection request will be sent to the best broker. The login information will be exchanged and if 


the password matches the client is online. 


Subsequently, what it will do is that it will let the broker know, that it wishes to share a certain 
number of files, that are there in a shared directory. The Meta data will be uploaded and once all 
the notifications are done, you can say that the client is ready to share. After that, the client can 
begin searching. So let us say you want to play a certain song. So you write the keywords of the 
song, the process of searching again will be initiated. And a response will be sent back to the client, 
such that the client can then establish a direct connection with the remote peer and get a copy of 


the song. 
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Nepsier Protocol 


| Download Instance 


@ Starts from the download state. 
@ It then sends a download request message to the broker, 
and waits it for a reply. 
lint There are many reasons for begin denied: illegal request, 
ae _ the remote host cannot upload a file, and the file is not there. 
@ If the file can be downloaded the broker sends the down- 
| LN load ack message to the client. 
ow @ The contents of the download ack message: 
_-> © Location of the file (remote hostname, or IP address) 
oCTCP port ae 


@ If the port is(0then we enter the remote client upload state. 
dD) @ Ifitis non-zero, we enter the remote client download state. 
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So the download instance how does that work? Well, it starts from the downloads state, it then 
sends a download request message to the broker and waits for a reply. There are many reasons for 
the request being denied. It can be an illegal request, which is the remote host cannot upload a file 
or the file is not there, is a file can be downloaded, the broker sends the download ack message to 
the client, which means the client knows that look, yes, there is a remote peer that has a copy of 


the file. 


And the file can be downloaded from the remote peers. The contents of the download ack message 
like this, which the broker sends back to the original client. So let us say, let us refer to these terms 
like this, we are the client, then we have the broker whose job is to search. And then we have 


another client but let us call this client remote peers. 


So once a download ack message received by the original client, it knows that it needs to establish 
a connection with the remote peer. So the information that will be sent back to it will be the location 
or the file that includes the remote host name or IP address and the TCP port. So let us first start 
with if the port is nonzero, then you will enter the remote Client Download state and there is no 
problem message will be sent and then we can download the file from the remote client. However, 
there is a problem. So let us consider a university setting. And the remote client is kind of behind 
a proxy server as it is the case in most universities which have their private networks and behind 
a firewall and a proxy and so on. So, in that case, it is not really possible for any machine outside 


the organization to initiate a direct connection to a machine inside the organization. So this will 
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not make sense. If this will not make sense, then Napster did provide a wonderful mechanism 


because primarily this was meant for college students to share files. 


So it return the special value of zero for the port. And we entered the client upload state, which I 
will say so this was a nice trick you can think of this this was one of the dirty tricks that gave 
Napster fame as well as infamy. So Napster, in that sense is extremely controversial, because it 
allowed college students behind these closed networks to actually establish connections and 
exchange data. So a key part of this was that if let us say the remote peer is, let us say within a 
university, that does not allow a direct external connection initiated from outside, then a port of 
zero 1s written and we enter a different state called the remote client upload state, we will see what 


it is in the next slide. 
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The remote client upload state, slightly complicated, which as I said, the remote peer is behind a 
firewall, it does cannot export a link for download. So then what happens is? so we have the client 
here, we have the remote peer here and we are the broker here. they will seek the brokers help, it 
will send an alternate download request. So instead, what the broker will do? it will ask the remote 
peer to initiate a connection with the client rather than the client initiating a connection with the 
remote peers. So the client will wait. And since it knows that the remote peer is behind, of all right, 


and network wall, it will ask the remote peer that look, this is the address of the client. And what 
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you do is that you send a message to the client. And then once there is a connection between both 


of you, then you can transfer the file to the client. 


So if somebody has not really if somebody really does not know what is a NAT and network 
address translation, and what it means to actually be in a closed network like a university? I would 
suggest that you kindly look up this piece of metal working. And so this is easily understood for 
somebody who knows the concept. Otherwise, the greatness of this will not become, it will not be 


aptly clear. 


So if you are not able to understand kindly look it up, what you need to look up is called NAT, 
right address writing it down NAT network address translation. The key idea of NAT is that if the 
remote peer is behind, if the remote peer is within a secure internal network, it is not possible for 
the client to contact it. However, if the client is publicly visible, then the remote peer can establish 
a connection with the client, set up a connection with the client, it can then send the metadata and 


the contents of the song. 


So then the client will anyway get the contents of the song. But it will happen via this mechanism. 
Of course, the simpler is if the remote peer is directly visible, then the client will directly send a 
request to the remote peer and get the song back. But because many university setting the remote 
peer as such is not accessible. The broker will say tell the remote peer that looks since you are 
behind, since you kind of have a private IP address, which starts with 10, which means you are not 


externally visible. 


So you need to know what a ten dot.IP address means? So since you are not externally visible, you 
initiate a connection with the client and get the transfer done. So I hope that this sticky notion of 
remote client upload and remote client download is clear. If not look at NAT. But assuming what 
the concept is? The concept now basically is that even for users where one was kind of behind a 


firewall in a private network, there was still a way to do it. 
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So now let us come to the upload instance. So when I am uploading the file, which means for them 
transmitting the file, I received, the ID received the file name, I set up the transfer, so I send all the 
bytes of the file, the upload is over and the connection is closed. So there is no great rocket science 


in this. 
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Security and piracy. So this is why Napster ultimately got a bad name? So the first is that it is hard 
for clients to lie. Okay, they cannot give a fake details such as an IP address, otherwise, who do 
you transfer it to? So the central site, in this case, the central site is the broker has all the control. 
It is aware of every single transfer. So later on, if let us say there is any legal issue, then you can 


pretty much catch the central site. And then that is trouble. 
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So, the problem that happened is that millions of people were freely sharing these copyrighted 
songs. And furthermore, universities were providing free internet facilities. And these facilities, 


frankly speaking, were being abused. So by the early 2000, most universities actually banned the 
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use of Napster. And pretty heavy penalties were put on students and other people who were using 
Napster to share files for which they did not have legal licenses. And also, this did cause a huge 
amount of revenue loss for the music industry. So the music industry did run after many people 


pretty hard. But peer to peer file sharing did not stop. 


So the problem with Napster was the issue of the central server, which is aware of everything. So 
pretty much you get control of the central server, you find every single transfer and then people 
are suddenly legally liable. So Gnutella got rid of the idea, the central server. And also later 


protocols by Gnutella and so on, did not limit the sharing just to music files. 


They brought in all kinds of files. By the time Gnutella came, which was roughly like, after 2000. 
All kinds of files could be shared with reduced legal liability. But still it did not make a transfer of 
a file for which the user did not have a valid license legal. Nevertheless, it was an improvement 


from a computer science sense at least. 
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Now let us come to Gnutella. So Gnutella is another distributed network. We are calling it a second 
generation peer to peer network. It did remedy some of the flaws of Napster in the sense that it did 
not have a central server. So then in this case, a Gnutella host. So what we are calling a client or a 
peer there we will call the host over here. It will join the network first by contacting another 
Gnutella host. So you are assuming that it knows of some other Gnutella host you can contact it 


enjoy. So what it will do is that once it has joined, it will send a file Search Messages to its 
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neighbors and the neighbors will forward the messages to other neighbors. So in a sense, it will 
dynamically discover the network. So if you consider a graph like this, first it will send a message 
to its neighbors then to they will forward it to their neighbors, they will forward it to their neighbors 


and so on. 


So it will know that in its neighborhood the names of all the files that are there. And of course, 
every message you don't want the entire internet to be swamped with messages. So every message 
has a message and a TTL field a time to live field that with every hop, the TTL was decremented. 
So ultimately, the message is died out. So if let us say that the TTL was equal to k, then this means 
that within a radius of K hops, you are going to have service of this, all of them all the leaf nodes 
are a radius of K hops over here, the message used to go to all of them, and the entire list of files 


used to come. 


And this is how a node used to initialize itself? And once the server in this case used to reply, the 
client established a direct connection with it to download the file. And a server in this case is a 
remote period which to which a request used to go from the client saying that can I get this file 


from you? and then the remote be it used to say yes, if it had the file. 
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A quick overview. So the overview is like this, that Search Messages are basically if this is the 
client, then search messages are sent everywhere within the radius of K hops. And if anybody has 


the results, the results are sent back. 


(Refer Slide Time: 22:15) 


a Overview 
Gnutella 


| Entities of each Gnutella Peer 


Connection Handler 
Manages connections with other Gnutella peers. 


Co-ordination Instance 


Co-ordinates connections with other Gnutella peers. 


Download Instance 
Handles a download from a remote peer. 


Upload Instance 
Uploads a file to a remote peer. 


Smruti R. Sarangi First and Second Generation Peer to Peer Networks 


So again, every Gnutella peer would have four of these entities which would be a connection 


handler to manage connections with other Gnutella peers or coordination instance, which would 


coordinate connections with other Gnutella peers a download and upload instance as we have seen. 
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Now come to the details. Initially, the connection handler is in the offline state, it opens a 
coordination connection with another Gnutella appears. And since a connect message, so this 
would change the state to online. So let us assume that it somehow knows the idea of a Gnutella 
peer or there were some open source openly visible servers, which could at least connect you to 
Gnutella peers. So mind you, there was no legal liability there, because all that it was doing is that 
it was connecting you to another Gnutella peer. Nothing more than that it was not coordinating 


any transfer. So technically speaking, or legally speaking, connecting to Gnutella was not illegal. 
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But transferring a file without licenses was illegal. So transferring illegal content was illegal. So 


that is the reason these lookups servers never participated in any information transfer as such. 


In the online state, the client might ping other peers to find the size of the Gnutella network. Now, 
when it wished to search for something, it entered the search state. So what can happen is that 
initially, there is some amount of network discovery and when you discover the network within K 
hops, each one of them sends the list of files that it has or the list of files and things others have to 
it, they create some sort of a localized database use for searching that otherwise what you do is 
that you broadcast a search message, and of course every message in a time to live field which 


meant it ultimately died out. 


And this was sent to a large part of the network. And once it was done, if any peer knew who has 
it, or if they themselves had it, they would return with the search results. And if the particular 
machine was not behind a firewall, and a Gnutella protocol would start a connection to transfer 
the file. Otherwise what you have seen in the previous case even with Napster where we had a 
remote client upload state, we had something very similar here because again this was made for 


college students, who unfortunately are known to transfer files illegally without licenses. 


So Gnutella called in a similar mechanism, the client push request fair again, the remote peer, 
which in Gnutella terminology? we also refer to as a server even though Gnutella does not have a 
central server, but I am just going with the terminology that the Gnutella people use. So we, the 
remote peer in this case, would initiate a transfer in the client push model and send the file that is 


if the remote peer is behind some sort of a firewall or, or it is within some internal network. 


(Refer Slide Time: 25:34) 


63 


nutell 
sels Details 


| Requests in the Gnutella Protocol- II 


>) The co-ordination instance can receive 5 types of messages. 


@ ping: Thel TT wit be decremented and the message will 

be forwarded'to remote peers. The returning pong message 

_ +-~ Will be sent back in the reverse order of peers to the client. 
e1@ pong: Update local database with information about remote 
peers. Send the pong message back to the original client. 

@ search: If the file exists locally, then return a search re- 

sult message. Otherwise, decrement TTL and forward the 


message topeers. «© S = wit! 
@ search-result: Update the search results and try to down- 
load the file from the remote peer. 
/ @ client push: Create a new connection to the remote peer, 
send the giv message, and transmit files through an upload 
instance. Pe 


Smruti R. Sarangi First and Second Generation Peer to Peer Networks 


So now let us come to the coordination instance, the coordination instance can broadly receive five 
types of messages. So the coordination instance is broadly looking at five kinds of messages. So 
the five kinds of messages that we are looking at here, the first two are pretty much net network 
discovery messages, the ping message and the pong message, ping pong, so a ping message, it is 


more like a new client joins. 


And then it sends these ping messages to all over its neighborhood with a time to live field. If this 
is key, every time that we traverse one hop, the TTL field is decremented. And the message is 
forwarded to remote peers, then the returning. So when it reaches the end, what comes back is a 
pong message. So when the TTL becomes zero, what comes back to the original client is a Pong 


message, which is sent back in the reverse order of peers to the client. 


So how does it work ping, ping, ping, ping, and again, back pong, pong, pong, and pong. So when 
the pong message comes back, we update the local database with information about all the remote 
peers. So remote peers can do the same to between themselves, read along the path of the pong 


message, and the pong and just keep on forwarding the pong messages towards the original client. 


So this helps clients within a small neighborhood at least share messages between each other, and 
share information between each other regarding which song is there. So they maintain a localized 
database, because there is no central server. So then let us say that we know that a given song is 
there at some IP address. Then we can establish a connection directly. Now, let us assume that 


even after this network discovery, we want a song, which we don't know where it is? So what we 
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do is? if it does not even exist locally, you take it otherwise, we send a search message. So 
basically, if it exists locally, then of course, you return back a search result message. Otherwise, 


the search message is sent where the TTL field is decremented. 


So the TTL field is always there to stop flooding in the network. And the message is forwarded to 
the peers. So what do we do? if we have one client, we just keep on sending search, search, search, 
search and so on. And the moment some information is found out, similar to ping and pong, what 
comes back is search results search result and soon. So the search results come back in the reverse 
order with the address of the remote peer, who has it? So we also refer to the remote peer as a 
server occasionally in Gnutella. But that is not all that common even though we have done so once 


in the past. 


So here again, we have the same idea that if a connection can be established with the remote peer, 
it is fine. Otherwise, the remote peer realizes from the request that the original client cannot 
establish a connection with it. What it does is that it creates a new connection And transmits files 
through an upload instance. So this is similar to the client push is similar to the remote upload that 
we had in Napster, when the remote peer was behind a universities internal firewall. So in that 
case, what would happen is that the remote peer would establish a connection with a client and 


supply a copy of the file to it. 
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@ Traffic () &) 


5. Napster Here also scalability is an issue because every 


, 


time the client connects to a broker, it sends a 
a list of all the files that it has. 

\ 7 Gnutella Scalability is a big concern because of the ex- 
\ a ponential number of ping and pong messages. 


Smruti R. Sarangi First and Second Generation Peer to Peer Networks 


Resiliency wise, if I were to compare Napster has a mechanism of finding the best broker doing 
some centralized load balancing in the sense that it ensures no broker is overloaded. So this is all 
the advantages you get from a centralized network where you have more control. And also you 
have a centralized directory for lookup. It is not really a search based system. But here again, there 
is a massive amount of legal liability. Gnutella has reduced that to a large extent by being fully 
distributed. So the only job of lookups servers and Gnutella is to actually connect them to a host. 


And henceforth there is network discovery as well as searches within the neighborhood. 


It is furthermore resilient to network partitions as well, because assume that the network gets 
partitioned, the sub network here does searches among itself and the sub network here does 
searches among itself. So Gnutella in that sense is extremely flexible in the way that it operates. 
So in terms of traffic, scalability is an issue because whenever you have a centralized server, it 
does see lots and lots of requests, the requests kind of swamp it. And because of that, what happens 


is that becomes a performance bottleneck. 


Gnutella also scalability is a big concern and bit in a different manner. So what happens is we send 
a lot of ping, pong search and search result messages. And they are all kind of broadcast within a 
radius of K starting from the client. And if you have many search messages that are being 
broadcast, then a local network can be full of such, search messages. Which was not happening in 


Napster, we just sent a single message to a centralized server and we got a request. But in Gnutella, 
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this issue will happen. Where we will say essentially be flooded by these discovery messages, just 


to find out who has what? So they have different kinds of issues. 


Napster law, being less resilient, Gnutella being more resilient, Napster not being scalable, because 
you will have a lot of requests that the central server. Gnutella and not being scalable, because for 
discovering who has what, and also for the searches, a lot of search messages need to be broadcast 
throughout the network. So they have their own pluses and minuses. But clearly Gnutella, the legal 


liability is lower. 


Hence, we are calling it a peer to peer network because in this case, the Gnutella server who the 
job of the server is only to locate a peer. If there is a new node that joining is only job is to locate 
a peer. Nobody maintains any usernames or passwords or anything. So this centralization is not 
there. It is truly a distributed system. And in the sense that Gnutella does not really say that you 


transfer illegal stuff, just in case two peers are doing their mutually responsible, nobody else. 
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In terms of search quality. Well, Napster needs to link all of its brokers across network, it needs to 
maintain a centralized database. This is hard to do, but you will get a better result if you can do it. 
And of course, there are legal issues. The ping and pong search and search result messages have a 


limited radius, this can cause Gnutella to miss a lot of files. 
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So given the fact that we have discussed Napster and Gnutella now, when Napster being let us call 
it gen one, Gnutella being gen two, we have seen the legal liability reduce. But this flooding of 
messages is Clearly the issue with Gnutella. And furthermore, if let us say two peers are exchanging 
some data, they are clearly aware of each other’s IP address. And all the nodes along the route are 
also aware of who is searching for what and who gave what search result? So in a sense, 


anonymously is not fully guaranteed. 


So we will now see a few more technologies, where we can have free net, which was precursor to 
dark web and POD networks, where anonymous is pretty much guaranteed. So nobody knows 
what I am searching for and what I want. This is far more anonymous and secure. Again they are 
been used to do wrong things and bad things, but from a computer science, knowledge point of 


view you should know what is, and then will come into Bit Torrents. 


So Bit Torrent, of course I have seen lots and lots of commercial application in a sense most of our 
model bit systems such as Amazon, Facebook, Google, etc. They keep technology the use is similar 
to what Bit Torrent uses and here there is flooding out messages. It uses a beautiful thing called a 
distributor hash table. So we will discuss that, this is pretty much the way that we are going. This 


is pretty much the direction in which we are going. So is this where the current lecture ends. 
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Advanced Distributed Systems 
Professor Smruti R Sarangi 
Department of Computer Science and Engineering 
Indian Institute of Technology Delhi 
Lecture 03 
Pastry: A distributed hash table 


Welcome to the study of distributed hash tables. Distributed hash tables are by far the most 
commonly used data structures in distributed systems. Many commercial systems are built on 
top of distributed hash tables. Pastry is one of the most common designs in this space. And as 
we shall see later in the course, many, many designs are built on top of Pastry or a Pastry like 


system. 
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® Results 


So a brief overview of this slide set, we will start with describing what distributed hash tables 
are? and then we will move to describing the specific operations of Pastry. So we will discuss 
in a broad overview, the operation of Pastry, arrival departure and locality of nodes, and finally 


the results. 
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Normal Hashtables 


@ Hashtable: Contains a set of key-value pairs. If the user 
supplies the key, the hashtable returns the value. 


@ Basic operations. 


insert(key,value) Inserts the key,value pair into the hashtable. 


{ lookup(key) Returns the value, or null if there is no value. 
| delete(key) Deletes the key ; 
| Time Complexity Approximately, (1) 

os 4. 


@ Need a sophisticated hash function to map keys to unique 
locations. 


@ Need to resolve collisions through chaining or rehashing. 


So before describing what is a distributed hash table, it is important to describe what is a normal 
hash table? So any student of data structures would have encountered this term before, so there 
is nothing new over here, a hash table in this context is also the same as what we have over 
there. It is essentially a dictionary that contains a set of key value pairs, that is the important 


operative term over here. 


It is a set of key value pairs where the user simply supplies the key and the hash table returns 
the value, as opposed to an array where we index based on an integer, here we send the key 
and then we read the value. So there are several functions that a hash table supports. It supports 
the insert function, insert key value, inserts a key value pair into the hash table. There is a 
lookup operation. So for a given key it either returns the value or returns null if there is no 


value. 


A delete key operation where we delete a key and approximately all of these operations take a 
constant amount of time, approximately constant, theta 1 amount of time. So the design of most 
hash tables requires a sophisticated hash function to map keys to unique locations and 
essentially reduce the chances of collisions. If there are collisions there are two standard ways 


of dealing with them, they are known as chaining and rehashing. 


So chaining basically means that we, essentially every entry of the hash table is a linked list 
and we traverse the linked list. Rehashing means that if one entry is full, then we search for an, 


in an alternative location, that is rehashing. So a distributed hash table is the same. 
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Distributed Hashtable 


It is of course, much bigger. Well, how big? So consider all the movies the Netflix stores, it 
might be millions of movies. So if millions of movies are to be stored they cannot be stored on 
one machine. They have to be stored on thousands of machines and these thousands of 


machines also have to be geographically distributed. 


Because if there are not, if they are not distributed and there is a power failure in one geographic 
location, then what can happen is that the entire set of movies will go and in this case your 
favorite provider Netflix will not be able to operate. Hence, a distributed hash table is not meant 


for few entries like ten hundred and thousand it is meant for a million or billion entries. 


So consider Amazon, just look at the share number of items that Amazon actually lists on its 
store. So if I were to have a quick search operation, in the sense let us say I am interested in 
the coffee mug and if I just want to, if I type coffee mug and I would like to see all the matching 
entries for coffee mugs, then I would need a hash table of sorts and that is why this forms the 


key search structure in almost all major cloud infrastructure, major cloud-based systems. 


That includes Amazon, Netflix a large part of Google, LinkedIn, Facebook as we shall see in 
the third part of the course. So in a hash table we have three operations, lookup, insert and 
delete. So what we do is we will see this as a recurring theme that we typically organize the 


nodes that are storing the key value pairs in a circular format. 
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So this is essentially a virtual network, a virtual network is also called an overlay, so regardless 
of how the real network is we organize these nodes in a virtual circular form known as an 


overlay. And we will see this makes our job of locating where a key is substantially easier. 
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Why do we say that we need DHTs? Well, we need DHTs as we have discussed to store web 
scale data, which means a lot of data. So let us consider an example. So assume a bank has 10 
Crore customers, the Crore is an Indian term. So roughly a Crore is 10 million, so it is 100 
million customers. let us assume that the storage space that is required for each customer to 


store his bank account is the same as the size of this latex file. 


So incidentally this entire presentation was made using latex, so the latex should rather be 
represented like this, it is a more correct form of doing it. So what is latex? It is a wonderful 
type setting system. I have used the beamer package, so would encourage all of you to take a 
look at that. So I also took a look at my bank accounts and I took a look at all the transactions, 


I put them in a text file. 


So the size of all of that was roughly the size of this file. So this file is around 8 kilobytes and 
8 kilobytes times 0.1 billion entries is 0.8 terabytes. If you would see this is not much, so a 
regular desktop processor or even a combination of pen drives can provide you this much of 
space. So definitely one desktop processor or even one laptop, a modern laptop these days can 


provide you 0.8 terabytes of space. 
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Incidentally, the laptop that I am using to record this video has half terabyte of space, so this 
basically means that if Ihave 100 million customers in a bank, which is not a small number by 
any means, the total data requirement is not much as opposed to web scale data where I would 
actually need much, much more. So that is the reason for a traditional banking application, I 


can use a traditional database but for a much larger source of data I would need a DHT. 
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@ They can store more data than centralized databases. 
@ DHTs are the only feasible options for web-scale data: Face- 
book, LinkedIn, Google 
@ Assume that a bank has 10 crore customers (0.1 billion) 
@ Each customer requires storage equivalent to the size of this 
latex file, 
® Total storage requirement: 8 KB x 0.1 billion = 0.8 TB 


@ Auser is sharing 100 songs : 500 MB/user 
@ There are 10 crore(0.1 billion) users 


@ Storage: 50 PB (petabytes) 


There is a difference of an order of magnitude !!! 


So how large are web scale data? How large is web scale data? So let us say user on an average 
shares 100 songs, assuming an MP3 format, it will be 500 megabytes of space per user, if there 
are again 0.1 billion users so it is easy to do the math. We will have 50 petabytes of storage, 
which is way more than this number. And it is also not possible to have that much of storage 


at a single location. 


Number one, that well it is a huge amount of storage so the operating costs, the cooling costs 
will be very high, and second, if let us say there is a power outage the entire system is gone, 
see, we would like to have a large system consisting of a large number of machines 
geographically distributed that can provide this much of storage and this is a significant, even 


in today's technology this is a lot of storage. 


So the main aim here should be that for web scale data we provide tons of storage and this 
storage in a certain sense is used to provide web scale services. So a difference is clearly an 


order of magnitude and this is something we need to bear in mind. 
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Advantages of DHTs 


@ DHTs scale, and are ideal candidates for web scale stor- 
age. 

@ They are more immune to node failures. They use exten- 
sive data replication. 

@ DHTs also scale in terms of the number of users. Different 
users are redirected to different nodes based on their keys. 
( better load balancing). __ "7"? : fe 

@ In the case of Torrent applications: they reduce the legal 
liability since there is no dedicated central server. 


oa Major Proposals 


| Pastry, Chord, Tapestry, CAN, Fawn 


So DHTs scale, they are ideal candidates for web scale storage. Furthermore, they are immune 
to node failures in the sense if there is an outage in one region, it is okay. So this happens all 
the time, servers have an outage all the time, but we never really see Google, Amazon, Netflix, 
etc. actually go down, primarily, because they rely on technology that is immune to at least 


localized node failures. 


Furthermore, the scale in terms of number of users, so here is one fun fact, so particularly in 
the U.S. most of the shopping happens, if I were to draw a graph of shopping most of this 
shopping would happen, there are two peaks, around two times, so one is Thanksgiving and 
the other is Christmas, so that is when most of the shopping actually happens and the rest of 


the time the amount of shopping is not much. 


See, if I were to see in India, most of the shopping would actually happen now when I am 
recording the video, which is roughly a 3 to 4 week time, so it starts with the festival Dussehra 
and then it pretty much ends with Diwali. So this is a three week window, so this is where most 


of the shopping happens in most of India. 


And so, in this time frame if let us say I am an ecommerce provider, I would need a massive 
amount of compute and storage, not for the rest of the times. So what we would want is we 
would want a system that can kind of that, is sense fluid, whenever I want to add new servers 
it will be very easy to add new servers and once the rush hour is gone, then I can remove the 


servers. 
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For example, Thanksgiving is roughly in October, November timeframe festival, and 
Christmas is December 25". And whereas, Diwali and Dussehra are September, October time 
festival. See, if I am running a global ecommerce system then it is very easy, I can add some 
servers during Diwali, Dussehra and again remove a few of them, again add them during 


Thanksgiving, then remove a few of them, again add them during Christmas. 


So what we need is we need some amount of flexibility with adding such servers. Furthermore, 
many of you would have used a BitTorrent like applications. So BitTorrent unfortunately even 
though it is great technology and that led to the DHT technology mature in, sadly many users 


do put illegal videos on the Torrent servers. 


So, having a distributed DHT does reduce the legal liabilities, but again that was never the 
intended use. And just a word of caution, please, please, do not access, download, post illegal 
videos, videos for which you do not have a license on any kind of a BitTorrent system. The 
torrent does use a DHT that is how it works, that is another great example of DHT but that 


should not be used to encourage piracy because piracy kills the movie industry. 


It kills the music industry and since all of us like music we would like musicians and 
filmmakers to be paid as well. So what are some of the major proposals in the world of DHTs, 
well, we have Pastry, Chord, Tapestry is something that BitTorrent uses not exactly, but 
systems are similar; CAN and Fawn, so these are some of the major DHT related proposals 


that are there. 
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Salient Points 


@ Scalable distributed object location service. “) 
@ Uses a ring based overlay network. Z 

@ The overlay network takes into account network locality. 

@ Automatically adapts to the arrival, departure, and failure 


of-nodes. 


o Pastry has been used as the substrate to make large stor- 
age services (PAST) and scalable publish/subscribe sys- 
tem(SCRIBE). 

@ PAST is a large scale file storage service. 
@ SCRIBE stores a massive number of topics in the DHT. When 
a topic changes, the list of subscribers are notified, 


A brief overview of Pastry. So pastry is a scalable distributed object location service where 
pretty much we store objects in a key value format in a DHT. We use a ring based overlay 
network, which means that it is an, overlay is a network over a network, it is a virtual Network. 
And regardless of where the servers are physically there we just assume they are logically 


arranged as a ring. 


This network does take into account locality in the sense that some amount of locality between 
the nodes is taken care of, we will see how. So the locality means a physical proximity, some 
of it is taken care of, but it is important to understand how. Furthermore, this is a very flexible 
ring in the sense it can shrink, it can grow, if we want to add new servers the size of the ring 


will grow, it can become much bigger. 


And so basically it is rather flexible in that sense. It adapts to the arrival departure and failure 
of nodes. So this should be off over here. Pastry has been used by some popular storage services 
like PAST and SCRIBE. So PAST is a large scale file storage device and SCRIBE is where we 
publish and subscribe to topics. But as I said it was one of the earliest DHTs and it has in a 


very positive way influence the development of DHTs quite a bit. 
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Design of Pastry 


@ Node -+ has a unique 128:bit nodeld. oat 
@ The nodes are conceptually organized as a ring, arranged 


in ascending order of nodelds. 
@ nodelds are generated by computing a cryptographic hash 
of the node’s IP address or public key. : . 
; N. ) 
@ Basic idea of routing: 2 
© Given afey) find its 128 bithash, "7 Nw 
@ Find the node whose nodeld is numerically closest to the 
hash-value, of £7 uy 
@ Send the request to that node: 


So let us now without further ado discuss the design of Pastry. So we assume that every node 
is associated with the 128 bit node ID. So the node ID can be generated as follows, we can for 
example, take the IP address of the node and we can apply a certain transformation to the IP 
address. Like for example, we can encrypt the IP address using the SHA algorithm, so then we 


will get a piece of encrypted text. 


And in the encrypted text also known as a cipher text we can extract the lower 16 bytes. The 
lower 16 bytes is essentially 128 bits. So these 128 bits can be extracted and so this is how we 
can give every node a unique 128-bit node ID? So now the aim is to conceptually organize 


these nodes as a, ring where we assume that they are arranged in ascending order of node IDs. 


So, let us see if we were to walk clockwise. So we have a node here in ascending order, we 
will have nodes, so it is kind of a conceptual ring where they are done but of course, ascending 
order will happen and there will be one point of discontinuity which is the beginning, we are 
okay with that much, but otherwise they will be arranged in a, they will be arranged in 


ascending order. 


So node IDs are generated by computing a cryptographic hash of the nodes IP address or a 
node can have a public key that could be the 128-bit node ID. So what is the basic idea of 
routing here? Well, given a key, the basic idea, so we are not coming to routing, we are not 
looking at the design, the design has already been mentioned that we have a set of nodes, each 


node has a 128 bit node ID and it is organized conceptually as a ring. 
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And we can assume without loss of generality that as we walk clockwise the node ID is 
increase. So how do we locate a key over here? Well, for the key we do the same. We compute 
a 128 bit hash, so let us see that this is the, these are the nodes, the cuts are the nodes. So we 
need to find that node ID, which is numerically closest to the hash-value, hash-value of what? 


hash-value of the key. 


So what we do is that we need to walk this ring and find that node ID, which is the numerically 
closest to the hash-value of the key. So let us see it is this node, so then what we do is that this 
node ID, this node with this ID is automatically responsible for storing the key value pair. 


Which key value pair? Well, the key value pair off this key. 


So what is the broad idea? So just think about it. This is the core idea of DHT. So I will go over 
this several times, it is important for the viewers of this video to get the idea. So the idea is that 
we have a circular ring of nodes. How is it done? Well, it is a conceptual ring, it is not a physical 
ring, where we assume that nodes are organized in increasing order of their node IDs. Now 


given a key we need to find the node that contains its value. 


So what we do is that from the key we create a 128 bit node ID and this number is used to find 
the node ID within this ring that is the numerically closest to it, minimizes the absolute value 
of the distance and let it be this. If it is this we send a request to this node, provide the key it 
will provide us the value. As simple as that nothing more nothing less. This is the straight cut 
simple idea of a DHT and almost all major commercial systems use some variation or 


modification of this. 
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Advantages of Pastry 


by " 
@ Pastry can route a request to the right node in less than 
logss{N)\ steps. b is typically 4. 

@ Eventual delivery is guaranteed unless L/2 nodes with ad- 

~~ jacent nodelds fail simultaneously. L = 16 or 32. 
*” @ Both nodelds and keys are thought to be base 2” numbers. 
If we assume that b = 4, then these are hexadecimal num- 

bers. 


So what is the advantage of Pastry? So why is Pastry great? So let us say we have n nodes, so 
what Pastry claims is that in less than log n to the base 2? L, come to what that is, but let us 
assume that 2? = 16. So, I will discuss this later in great detail, but less than log,,N steps 


Pastry can route the message, the message meaning the key can be sent to the node ID. 


So we will discuss in detail what is a step? But let us say step is sending one message, so with 
less than login to the base 16 messages, we can find the correct node and we can send a message 
to it and what will that message be, that look this is the key what is the value. And so this, 
instead of 16 it can be 8, it can be 32, but 16 is most commonly used. And so, but because it is 


a power of 2 that is the reason I have written 2”. 


If B is typically 4, 2?= 16. Pastry does a wonderful job in tolerating node failures. Nevertheless, 
eventual delivery is guaranteed at the destination, unless L/2 nodes where L is a user defined 
constant with adjacent node IDs failed simultaneously L can be 16 or 32. So what this is 
basically saying is that look if this is the final node and let us say there are many, many node 


IDs that are different points of the ring. 


And similarly it is over here, so if L/2 nodes with adjacent node IDs, if they fail, then only we 
can say that we might not reach the destination. But if it is never the case that L/2 adjacent 
nodes fail, then we will reach the destination all the time. So well, why is this the case? Not 


now, I will tell you later, but this can be remembered. 


79 


This can be a point worth remembering that there is a certain amount of immunity to node 
failure that even if nodes fail we can still guarantee delivery at the destination, of course, 
assuming the destination has not failed subject to the fact that L/2 nodes with adjacent node 
IDs have not failed. One important disclaimer, both node IDs as well as keys have the same 


base. Well, again they need not have, but there are many advantages if they do. 


So both are assumed to be 2? numbers. Well, B is typically 4, so we will assume that 16, 2 


raised to the power B is equal to 16. 
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Basic Operation 
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@ In each step, the request is forwarded to a node whose, .),/- 
shared prefix with the key is at least 1 digit (b bits) more 
than the length of the shared prefix with the current node's 
nodeld. 

@ If no such node is found, then the request is forwarded 
to a node, which has the same length of the prefix but is 
numerically closer to the key. 


So what is the basic operation? Well, the basic operation is like this, that in each step the request 
is forwarded to a node whose shared prefix with the key is at least one digit mode than the 
length of the shared prefix, so the current node’s node ID. This is a very loaded statement. I 
would suggest that we go slow. And so you remember this, but I will keep visiting, revisiting 


this quite a bit till the idea is kind of clear. 


So, let me show this, so this is the key idea and I would suggest that this slide which is slide 
number 11 is seen several times till the idea is completely understood. So the idea is that we 
have nodes in the node ID space and the point where I am drawing, so they need not be in a 
uniformly placed, they can be non-uniformly placed as well; given that the node ID is coming 


from a random process. 


So what we do is that let us say I have a key and I want to figure out the IP address of the node 


within this ring that contains the key and consequently its value. So let us say that this is a 
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regular, okay let us say that it is an English to French dictionary. So if it is an English to French 
dictionary, so let us say the word that I want to search for is ‘with’ and the French of ‘with’ is 


avec. 


So I want to find out where ‘with’ is in this entire DHT. So what is my key here? It is ‘with’. 
What now I do is [run it through an ID generator which as I said is typically the SHA algorithm 
which is used. It will give me a 128 bit or 16 byte. So it will provide a 128 bit or 16 byte number 
and then what I will do is that I will go to any node which I know to be a part of the ring and I 
will give it this key. 


So let us say the first four, so let me now divide this into hex digits. So what are hex digits? 
Well, hex digits are hexadecimal digits and each hex digit is 4 binary numbers. So this is 
important. | hex digit is 4 binary numbers, if you are not able to understand why this is the 
case, then pick up a book on Boolean Logic and first figure out why this is the case and then 


you proceed. 


If you do not understand this you will not be able to understand what happens next. So 1 hex 
digit is 4 binary numbers, so I need to break it down to a sequence of hex digits and search for 
it. So let me now shift to another software, where I can show this process particularly in great 


detail. 
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So what did I say? I said 1 hex digit is 4 binary numbers and that needs to be understood. So 


given the fact that I have a 128 bit hash, I can divide it into 4 bit numbers. How many such 4 
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bit numbers will I have? I will have 32 such 4-bit numbers. Each one of them is a hex digit. 


Each one of them is a hex digit. 


So let us assume that for the key, what was my key? well, it was the English word ‘with’ and 
this key; the first few digits are like this, so these are all hex digits by the way. And then there 
are a few more digits will that I initially do not care. So what I would do is that given that this 


is the ring, I would know some node within the ring. So let us say this is a starting node. 


I would know some node within the ring and I will send a message to that node and I will tell 
the node, Hey look, this is my key, of course, I will send a sequence of 32 hex digits, look this 
is my key, kindly find the node that contains the key. So what that node will do is like this? 
That node will check its own node ID, its own node ID might be something of the form, it does 


not matter. 


So it will match hex digit by hex digit between the key and itself. So in this case, it will find 
that the first digit matches, which is a good thing. Given the fact that the first digit matches it 
is actually great. So then what it will do is that it will take a look at the next digit. So what is 


the next digit in the key? Well, what is the next digit in the key is ‘E’ and in this case it is ‘A’. 


So this is technically speaking, are not desirable, because we would like to have a match. So 
given the fact that these two digits do not match, what the node will then try to do is, it will try 
to find another node where the first two digits match with the key. In the sense, so let this be 
node 1, it will search its own data structures. So the data structure that it will search is called a 


routing table, that is the data structure it will search. In that it will try to find. 


If it knows any other node for which the first two digits match, in the sense the first two digits 
are F and E, so let us assume it is able to find one such node, so we had initially approached 
this node. So let us say after, so let me change the color, we are initially approached this node 


and let us see now it forwards the message to this node, which is one message forwarding. 


And here let us assume that this is node 2. With node 2 assume that the first two digits match 
‘F’ and ‘E’, but the remaining two digits of the node ID are, well, the same thing, given that 
the first two digits are matched. We then search for the next match, so the node 2 will be asked 
at look the first two digits match with the key, it is ‘F’ and ‘E’. Do you know of another node 


where the first three digits will match? 
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So let us assume it knows. So in this case node 2 will further forward the message to node 3. 
That can have a node ID of the form ‘F’, ‘E’ then ‘3’ and then maybe this is ‘6’. Well, in this 
case we have increased the match, so now the first three hex digits match and the fourth one, 


of course, does not match, so then we will proceed in a similar manner. 


So what will happen is that maybe this was node 1, it forwarded it to node 2, now more digits 
match, then again it forwarded it to node 3, now again more digits match and then node 3 will 
forward it to another node and another node and another node and ultimately what will happen 


is that we will have a fair number of digit matches. 


So we will, given the fact that it is 128 bits and that is a very large number, this process will 
terminate at some point. So we will mathematically see when and where? but let us assume 
that a certain number of digits match and after that we are not able to find a match. Then we 
will discuss one more algorithm of how to search the neighborhood to find the node that 


contains the key. But just look at the way that we are constructing these matches. 


So we have started from left to right, so at least most of the MSB bits, the Most Significant 
Bits, they are matching, so which is kind of bringing us closer and which is bringing the node 
ID and the key in a certain sense closer and closer and closer and ultimately the target node 
and the key they will sort of start falling in the same neighborhood and then we will apply a 


different kind of search and ultimately find the matching node. 


So recall what is the matching node? Well, the matching node is essentially that node whose 
node ID is numerically the closest the closest, so it is numerically the closest to the key, so as 
we are matching these MSB hex digits, we are getting closer and closer and closer, till they 
start lying in the same neighborhood and then we can apply a different algorithm and then 


perform the match. So what is the key important broad idea? 
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Well, the broad idea, let me summarize this. This is something that should be summarized. So, 
if I were to write the main points we take a note, we convert the nodes, identifier, whatever it 
is, like the IP address, to a 128 bit node ID. So that is not that important in this context in the 
sense that here since we are looking at a base 16 representation we would like to look at this as 


32 hex digits. And what is the relationship between a binary bit and a hex digit? 


Well, if you do not know then you should not be taking this course, first recapitulate Boolean 
algebra and then come back. The second is the key. Well, the key also we do the same, we have 
32 hex digits and then we start from the left as we did and then essentially we do a process of 


matching, where essentially we match digit by digit. Well, how do we match it? 


Well, the way that we actually match is that; well, we start from the left, we start matching the 
digits and if at a given node if only two digits match to make the third digit also match with the 
key, essentially one node forwards the message and node N1 forwards the message to node N2 
to increase the amount of matching, N2 then forwards it to N3, so on and so forth till we reach 


a neighborhood and there as I said we apply a different kind of search. 


And ultimately we find that node, which minimizes the distance with the key and furthermore, 
our claim is that this is being done in less than Log,,N steps or messages. So if we look at this 
even if N were to be a very large number, let us say n is | million. So then let us compute 
logi¢ 1000,000. So this is nothing but, so let us approximate that as log,¢(1024)?, which is 


equal to log of 
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1024 is 2 raised to the power 10, so this is 2 raised to the power 20. So if it is logy, 16°=5. So 
what we see is that even if we have a very, very large space off nodes and it does not matter, 
the routing can proceed quickly, rather very quickly. The reason for that is that we reduce the 


search space exponentially in each step. 


(Refer Slide Time: 38:31) 


Outline 


Q Pastry 


@ Operation 


Pastry Node 


Structure of a Pastry Node 


It contains a routing table, neighborhood table, and a leaf set 


Structure of the routing table 
Routing Table | 
15 IP address 
7 — of node 


Let us now discuss the operation of the key data structures in the Pastry protocol. So the most 
important data structure is a routing table. So a routing table in this case where we are 
considering a base 16 system, it will have 32 rows as we see over here and it will have 15 
columns. So why 15 columns and 32 rows if it is not self-evident by now? Well, if we consider 


two digits, so the two hex digits, then we actually have 15 possible choices. 
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So let us consider the first digit, say the first digit is let us say A, then well, if the first digit 
matches that is good, otherwise if it does not match then we have 15 possible choices over here 
and we have one column for each choice. So this is why we have 15 columns in each row. 


Second, given the fact that one hex digit is 4 binary bits. 


Again I said that the viewer should be in a position to understand why this is the case? if not 
go back study this and come back. So since there are a total of 128 bits and one digit is 4 binary 
bits, so 128/4 = 32. So we will have 32 hex digits in each ID, either the node ID or the key ID. 
So at the most we can have 0 bits matching till 31 bits matching, so hence, we will have 32 


TOWS. 
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Routing Table (R) 


@ The routing table contains 32 rows (O31). 


@ The entries at row j ( count starts from 1) point to nodes that 
share the first (i — 1) digits of the prefix with the key. 


@ Each row contains 2° — 1 columns. 
@ Each cell refers to a base 2° digit. 


@ If the digit associated with a cell matches the i"” digit of the 
key, then we have a node that matches the key with a longer 
prefix. 


@ We should route the request to that node. 


So the routing table thus contains 32 rows, let us number them from | to 32. So the entries at 
row i where the count is starting from 1, essentially point to all the nodes that share the first (1 


— 1) digits of the prefix, (i— 1) digits of the node ID with the key. 


For example, if it is the first row, then what it means is that the first digit itself does not match, 
so the number of digits that are matching are actually 0. And so then we try to match the first 
digit and let us say the first digit is over here, then we go to IP address that points to the first 
digit, so on and so forth. So this has been explained that each row contains (2? — 1) columns, 
if say B = 4, this is 15 and the matching process has also been explained. So the only correction 


in this slide is instead of 0 to 31 it will be 1 to 32. 


(Refer Slide Time: 42:26) 


86 


Some Maths 


@ The probability that two hashes have the first m digits com- 
mon is 16-. Let us assume 2° = 16 

@ The probability that a key does not have its first m digits 
matching with the first m digits of a node:.1 — 16°” 

@ The probability that the key does not have a prefix match 


of length m with all of the nodes: (1 —- 16°)” 
® Let m= + logy(n) 
@ We have: 


16 C fogre(n)))n 
16°°/n)? (\=16"°) 


So let us do a little bit of math. It is important to do this math to understand how quickly the 
Pastry protocol will actually converge? So let us find the probability that the two hashes have 
the first m digits in common, so what is the probability? Well, the probability that two digits 


are common is 1/16 or 167°. So this follows from the fact that we are using a base 16 system. 


So the probability that the first m digits are common is 16~™. The probability that a key does 
not have its first m digits matching with the first m digits of a node is essentially (1 - 167). 
The probability that the key does not have a prefix match of length m with all of the nodes in 
the system and assume there are n nodes in the system is essentially (1 - 167)”. 

Given that these are independent events, they will get multiplied, so it is this number to the 
power n. So what should m be? Well, let us set m = c + logi¢(n), so the logic of doing this 
will be clear very soon when we describe the rest of the mathematical expression, but let us 
assume that this is how we set it. So essentially m is, if I were to explain it in order semantics, 


m = O(log(n)). 


But let this be the exact expression for n, where we will discuss the importance of the constant 
c later. So the probability = (1 - 16-™)” = (1-167~¢—!2916™)”_ So this is (1-16~°/n)”. So let 


us say, let the constant (A = 16°). 


So given that this represents 16~°, we can simplify this expression somewhat, I mean, quite a 
bit, so 16~° = A/n. the exponent can be converted to (1 — A/n)™/*)*. So any multiplication and 
exponent will become a power. This is an expression that we have seen as n tends to infinity 


or as n becomes a very large number. 
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This is a very standard reduction; this number is equal to e. so this number is equal to e~* or 


3 So this is an important point to keep in mind that this number over here is a, this is a known 


identity that as n tends to be a very large number, this expression can be approximated as e~* 


and e~1 to the power lambda is e~*. So what is this probability again? 


Well, this is a probability that look I have a key, I am interested in the first m digits of the key 
and the first m digits of the key do not match with any node of the system and this P = e~*. So 


let us graph this. 
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@ As, cbecomes larger (let's say 5), \ = 16° becomes very 
small. Example: 16° = 9.5* 10°”, 


@ Essentially: \ + 0 


Some Maths 


@ The probability that two hashes have the first m digits com- 
mon is 16°”. Let us assume 2° = 16 

@ The probability that a key does not have its first m digits 
matching with the first m digits of a node:.1 — 16°” 

@ The probability that the key does not have a prefix match 
of length m with all of the nodes: (1 — 16°)? 

@ Let m=C-+ logye(n) 

@ We have: 


prob (1 16 may 16 c logre(n))n 
16°°/n\® (\=16"°) 
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So this is where, so if you look at lambda, well, what is lambda? It is 16~° = —. So as C 


becomes larger, so let us say c becomes 5, so (A = 16°°), which means lambda itself will 


become very small, so 167° = 9.5 *1077. 


So what I can do is that this probability that has been computed, this can be computed as a 
function of c. Essentially this probability is a function of c because lambda is a function of c. 
So if this is a function of c what I can do is I can have c on the x axis increasing, of course, and 


I can have probability on the y axis. Again you will find it increasing. 


So what you basically get to see is that as we increase the value of c the probabilities, well, 
starts at around 0.35 and gradually saturates to 1 by the time that c = 2 and by 3 definitely c = 
2 is good enough. So what this essentially tells us is that look if C = 2 then what will m be 
equal to? m = 2 + log,,N, so of course, the reason we have not considered n over here is 


because we are considering large n. 


So if you are considering large n, this expression essentially says that, look for any the moment 
that m crosses, the moment that m = 2 + log,,N, after that the first m digits will not match 
with any other node and the probability of that becomes pretty much equal to 1. So the 
probability that you will not have a match with any node and the fact that your routing tables 


will be of no use that becomes almost certain. 


So what is the key implication? Well, the key implication is that the number of routing tables 
that we actually need to jump, so basically what happens? So we have one node and then a 
message is sent to another node to increase the match, then again to another node, again to 


another node, and in every stage we increase the matches by one. 


But what this result says is that the number of messages we need to send between the nodes, 
which is essentially the fact that a key arrives in one node, let us say it arrives in this node and 
then it is iteratively sent across the nodes with the match increasing by one digit across the 
nodes, so at the most we will send it m times and by the time we send it m times, what we will 


find is that after ~ 2 + log,,N steps our number of matches is not increasing. 


What I can say in other words is that if I were to consider this number 2 + log,N, 
approximately I will need these many steps in my protocol to arrive at a node beyond which I 


cannot route to other nodes because the match in terms of digits will not increase. I might have 
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to do something else to find the numerically closest node, but at least the matching in terms of 


digits will cease after log n steps. 


So this is kind of hinting or providing us some direction that most likely the message 
complexity of Pastry, the number of messages we need to send to reach the node that owns the 
key, the numerically closes ID is somewhat close to, very close to this number and Pastry is 


essentially a log n step protocol. So the login step protocol, most of it is clear from the analysis. 


It basically says that look you send login messages, you will arrive at a node and which has the 
maximum amount of match with the key, from that note we need to find the node that is the 
numerically closest. So given the fact that consider both the node ID as well as the key ID. So 
given the fact that we are actually proceeding in the MSP direction and we are matching the 


top digits, we are already getting closer numerically. 


But once we stop at one point, it is possible that we have a set of nodes that have the same 
amount of match in these digits, out of them we need to choose the node whose rest of the 


digits make it numerically the closest. 
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Neighborhood Set and Leaf Set 


Neighborhood Set (.M) 


@ It contains M nodes that are closest to the node according | 
to a proximity metric. 


@ Typically contains 2°*" entries. 


nye I 


Leaf Set (£) 


@ Contains L/2 nodes with a numerically closest larger 
nodelds, 


e Contains L/2 nodes with a numerically closest smaller 
nodelds. fess, 
® Typically 2°. 


So this is where we will use two important data structures, two other data structures, the 
neighborhood set and the leaf set. The neighborhood set contains m nodes that are closest to 
the node according to let us say geographical proximity. So this will typically contain 2? + 1 
entries, in this case 32 entries. So the neighborhood set can be thought of as a geographical 


neighborhood set. 


So this can contain 32 entries. Furthermore, what we can do is we can have a leaf set where we 
typically have 16 leaves, so the leaf set for any node will be L/2 nodes on the ring with 
numerically closest larger node IDs and numerically closest smaller node IDs L/2 on this side, 
L/2 on the other side. So that is the leaf set. So the leaf set captures the proximity in terms of 


node IDs. 


So this is a proximity in the ID space and the neighborhood set is a proximity in the 
geographical space. In the sense that if I have a node in my data center, other nodes in my data 
center are a part of my neighborhood, but they are not a part of my leaf set. So this kind of 


captures network proximity or geographical proximity. 
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Routing Algorithm 


Algorithm 1; Routing algorithm 
1 Input: key K, routing table /R Hash of the key 
Ouput: Value V 


Wo, D« £,» then 
»K 
*) 2 forward’K)to L, such that | L; — K | is minima 
3 end 
4 else 
5 [+ common. prefix(K, nodeld) 
HER(/ + 1,0), .) ¢ null then 
forward to R(! 4 1,D),1) 
end 
else 
forward to (T © £U RUM) such that 
profix(T, K) > 
T ~ K \<| nodeld ~ K 


10 = end 
end 


So keeping these in mind let us proceed, so this is the routing algorithm. So assume that we 
have a key K, a routing table R, let the hash of the key be equal to D. So the key important 
variables, I am just encircling, and let the value that Iam after be V. So what I do is that first I 
check my leaf set. So this is the routing algorithm that every node will follow. So the first thing 


that you do is you check your leaf set, in the leaf set if it is between the first and last leaf. 


So let this be equal to Li. Let this be equal to L —1, which essentially means that we have L 
2 2 


leaves on this side L leaves on the other side, so for a given node, so I am at a given node, so 
if the given node’s node ID is this, I will essentially check my leaf set. If the key falls anywhere 


in this leaf set. 


So we always assume that the leaf set information is accurate, so the key falls anywhere in this 
leap set then we know for sure which node is a numerically closest to it, because our view of 
this region is absolutely correct. So what we can do is we can forward the key K to the leaf Li, 


such that the distance Li - K is minimal, so this itself is the closest node ID. 


So if I were to explain this in a different way, what this essentially means is that given if Iam 
close enough to the actual node I will come to another node such as the key lies within the 
bounds of its leaf set. So within that I will always find the node that is the closest to it. So this 


is an important concept, but I would suggest the readers to go through it several times. 
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Routing Algorithm 


Algorithm 1; Routing algorithm 
1 Input: key K, routing table R, Hash of the key » D 
Ouput: Value V 


Wo, D« Cy» then 
+ K 
2 forward K to L, such that | L; ~ K | is minimal 
3 end 
4 else 
6 1+ common prelix(K, nodeld) 
RU + 1D), .) ¢ null then 
forward to R14 1,D),, 
end 
else 
forward to (T «(CUR UMA) such that 
profix(T, K) > 
TK \<\ nodeld ~ K 


1 = end 
end 


So now, let us look at the more common case where it is not within the range of the leaf set. 
So in this case let me find the length of the common prefix between the key K and the node ID. 
So this is not hard to find at all, all that we do is we take a look at the key. And then what I do 
is I take a look at the length of the common prefix which is the number of matching digits. So 


then let it be equal to L, I would like to increase L to L + 1. 


So what I do is I take it in the routing table, I look at the L + 1’th row and I look at the next 
digit of the key, so this is what I would like to match. So in L + 1’th row I take a look at the 
next digit and if this is not equal to null, which means that I know of some other node, where 
a matching digit exists. Then I forward it to R 1 + 1, I forward it to that node. So this is a 


forward, and I essentially forwarding and the number of matching digits is increasing by 1. 


If it does not, if I do not find such an entry in the routing table, then what I do is that I look at 
all the nodes that I know, all the nodes for whom I have information within my leaf set, within 
my routing table, within my neighborhood set. Out of all the nodes I choose a node such that 
the length of the shared prefix is, of course, < | and essentially D - K, which is the distance is 


less than node ID - K. 


So I forward it to any node that minimizes the distance between the node ID and the key. I can 
always do better in the sense that I can forward it to that node that minimizes the distance 
instead of any node where the distance is just lower than my current distance, but it actually 


does not matter. As long as the distance is decreasing I am good. 
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So what is the key idea here? Well, the key idea over here is that I take a look at all the nodes, 
for which I have information. So that means I will scan all my data structures and of course, 
the length of the shared prefix should be the same if not more. So I will not compromise on 
that, so the shared prefix, the length of that has to be maintained. So I will never compromise 


on that and then essentially I will reduce the distance between the node ID. 


In the sense it is node T and the key K. So, this process is, of course, guaranteed to converge 
because what will happen is that if this is the current node ID, let us say on the ring, and this is 
the bounds of my leaf set, even if my routing table is not populated and the key is far away, at 


least what I will do is that I will find some node that is closer to the key. 


If I am not able to find I can always choose the edge of my leaf set because that is guaranteed 
to be closer to the key than the current node n. So, the current node n, of course, has some 
distance with the key, but I can always go closer by essentially looking at the boundary of my 
leaf set, the node which is at the boundary of my leaf set and forwarding the message, passing 


on the message to that node. So that is guaranteed to be closer. 


So what you see is that in every step we either increase the match by 1, if we are not able to 
increase the match then what we do is that at least we are able to reduce the distance and we 
are getting guaranteed to do that because we have a leaf set, so we are always able to reduce 
the distance, until the key starts falling within the leaf set of anode. Once the key falls anywhere 
we know for sure which node it should be, which node owns the key, because our view of the 


leaf set is perfect and then the message can be passed on to that node. 


94 


(Refer Slide Time: 61:42) 


Explanation 


@ The node first checks to find if the key is within the leaf 
set. If so, it forwards the messages to the closest node (by 
nodeld) in the leaf set. 

@ Otherwise, Pastry forwards the message to a node with one 
more matching digit in the common prefix. 

@ In the rare case, when we are not able to find a node that 
matches the first two criteria, we forward the request to any 
node that is closer to the key than the current nodeld. Note 
that it still needs to have a match of at least / digits. 


So the routing algorithm was simple the node first checks to find if the keys within the leaf set. 
If so, it forwards the messages to the closest node. Otherwise, Pastry forwards the message to 
a node with one more matching digit. If both the criteria are not there then at least we do not 
sacrifice the prefix length, but we send it to another node, which is numerically closer and we 
are always guaranteed to find such a node. So because of the leaf set we are always guaranteed 


to find such a node which is closer. 


(Refer Slide Time: 62:17) 


Performance and Reliability 


@ If L/2 nodes in the leaf set are alive then the message can 
always be passed on to some other node. 


0 @ At every step, we are ( with high probability) either search- 
ing in the leaf set or moving to another node. 


@ If the key is within the range of the leaf set — it is at most 
one hop away 


@ Otherwise, at every step we increase the length of the matched 
prefix by 2°, 


Routing Time Complexity 


The average routing time is thus O(/og-0(N)). 


So the performance and reliability, well, if L/2 nodes in the leaf set are alive, then what you 


can see is that, look, if this is the current node we have L/2 nodes of the leaf set on either side, 
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if one among them is alive, then at least the message can be passed on to that and this is 
guaranteed to be closer if you are traversing in this direction. Again in that node, for that node 


one among each L/2 nodes on either side is alive, then we can pass on the message to that. 


So we always have forward progress. So essentially what this protocol is trying to see is that 
as long as L/2 nodes on either set, at least one of them is alive we are guaranteed to reach the 
destination. So, I mean, in that sense there is some amount of built-in reliability and given the 
fact that the nodes in the leaf set are proximal, proximate in terms of IDs and not in terms of 


geographical distance. 


We can have these nodes physically located in different data centers such that all of them will 
not suffer from an outage at the same time. So as we said what do we do? We either increase 
the prefix length or bring the node closer to the key in every step. So the average routing time, 
so it can be proven with more rigorous methods, but definitely the prefix length match will be 


log n steps. 


And even without that, so even after that when we are traversing the leaf sets, it can be shown 
experimentally that the average routing time is order log n, O of log n and if you want to be 


slightly more precise that is O(log,»(N)). 


(Refer Slide Time: 64:29) 


val, Departure, and Locality 


Outline 


@ Pastry 


@ Arrival, Departure, and Locality 


So let us now look at the arrival, departure and the locality aspects of nodes. So it one important 
point is due here, so one important point that is due here is that when I am creating a large node 


in a large ring of nodes, given the fact that they are actually arranged by their node IDs and not 
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by any other metric. It is highly likely that if one node is over here, all the other nodes that are 


beside it are actually in other locations. 


So I would typically create a DHT out of many data centers or a many server farms and the 
server farm would be located kind of all over the world. You consider Amazon, Facebook and 
so on, they have lots and lots of server farms that are located all over the world, all of them are 
kind of connected together in a complicated snake like DHT. And of course, I am kind of 


drawing a simplified figure but it is way more complicated if I want to do it more realistically. 


Well, you might have one connection here that goes here, that goes here, again comes here, 
again goes there, again here, again there, it does not matter, it is still logically arranged. So the 
reason for this is that we want L/2 nodes on either side of every node to be alive, one among 
them to be alive and one among it will be alive, if let us say we spread all of them out 


geographically. 


So a logical ring allows us to do that. Another beautiful property of a DHT is that if you want 
to add new servers all that we do is we just expand the overlay and we just add the servers, so 
servers will get added at different points, so this can automatically expand to cater to higher 
demand. Like e-commerce sites during Christmas, so automatically what will happen is that 


your DHT can expand and you can add more nodes seamlessly. 


And then you can delete nodes same way, the deletion is just an opposition, is just an opposite 
of addition. So this provides the flexibility that many of these e-commerce sites and movie 


providers actually require because they can add servers and they can delete servers. 
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Node Arrival 


@ Assume that node X/wants to join the network. 

@ It locates another nearby node A. 

@ Acan also be found with expanding ring multicast. 

@ A forwards the request to the numerically closest node = Z. 


@ Nodes, A, Z, and all the nodes in the path from A to Z send 
their routing tables to X. 


@ X uses all of this information to initialize its tables. 


So the addition protocol is simple. It says that assume that node X wants to join the network, a 
new node. So it locates a node that is close to it in terms of geographical proximity, in a sense 
it can be another node in the same data set, in same data center, in the same subnet. Basically 


another node in the same area, which is close by, it is not on the other side of the world. 


If A cannot be found, well, we can do expanding ring multicast, which you would have already 
seen in the gossip protocol, which essentially means that node X sends a message to its | hop 
neighbors asking them if they are a part of the DHT. If not it sends a message to its 2 hop 
neighbors asking them if they are a part of its DHT, if not it sends a message with to its 3 hop 


neighbors. 


So this can easily be done with a TTL field of the TCPIP protocol, the IP protocol specifically, 
where we see that the message will be alive for only two hops or three hops, so with this it is 
possible to find a node that is closest to you that is a part of the DHT and many DHTs also 
have centralized directories that maintain a list of nodes that are close to a node that wants to 


join. 


So then, let the closest node be Z, so A forwards the request numerically closest node which is 
Z. So then what's the idea? Well, the idea is that I have a ring, node X wants to join, it contacts 
node A which is the closest to it in terms of geographical proximity. Node A then finds node 


Z, which is the closest to node X in terms of proximity of node IDs. 
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Then nodes AZ and all the nodes in the path from A to Z kind of cooperate and collaborate and 
update their routing tables to incorporate information that node X has joined. So what is node 
A? Well, node A is the closest in terms of geographical proximity or it also can be any node, 
but geographical proximity is preferred. We will see why and node Z is the one that is closest 


in terms of node ID proximity. 


So locating node Z is node A's job, once node Z has been located, then all the nodes from the 
path A to Z kind of collaborate and update the routing tables to record the information that X 
is now in the system. So this is, let us call it node ID proximity. So it is important to remember 
this fact that X is a node that is joining. It first contacts A, which is the closest. A then contact 
Z, which is the closest in terms of node IDs and all the intermediate nodes including A and Z, 


they the nodes from A to Z basically collaborate in this job of adding X to the system. 
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Table Initialization 


Neighborhood Set 
Since Ais close to X. X copies A’s neighborhood set. 


Zis the closest to X (nodeld). X uses Zs leaf set to form its own | 
leaf set, 


So how do we initialize the tables? Well, since A is close to X geographically, X simply copies 
A's neighborhood set because A’s neighborhood will be X’s neighborhood. Any node that is 
close to A will be by definition close to X. Say X will just copy A’s neighborhood set. 
Similarly, Z is the closest to X in terms of node IDs, so X will just use Z's leaf set to form its 


own leaf set. 


Of course, it will just slightly adjust it, so depending upon whether X is to the left or right of 
Z, see if this is Z over here and X is over here, then what it will do is that it will form a leaf set 
like this by taking data from Z's leaf set only and it will also form a leaf set to the right by again 


taking data from Z's leaf set and doing a few minor additions. 
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For example, the last node of A’s leaf set will not be there in Z’s leaf set, so it might have to 
contact a few more nodes to get more information. So it does not matter how it is exactly done, 
but the important point to keep in mind is that the neighborhood set will come from A and the 
leaf set will come from Z. So maybe one important point I just slightly missed out, so it is high 


time my sort of look at it again. 


So how is that located in the first place? Well, that is not hard at all, so X has a node ID, this 
node ID is given to A. A treats X’s node ID as a key. So what A would do is that the node ID 
that X is giving it, that node ID is treated as a key and that key is looked up at the DHT and 
then it finds the node that is the closest to that key, which will be Z and that key, since it was 
X’s node ID Z will be the closest to X. 


So this as you can see is lo and behold just the key lookup operation. In this case it is not a key 
but it is actually X’s node ID. That is how we get Z and the nice thing is that a neighborhood 
set will come from A and the leaf set will come from Z, but as I said some minor adjustments 
need to be made because the leaf sets are not exactly equal. Most of it, there is a huge overlap 


but they are not equal, that needs to be kept in mind, they are not the same. 
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Table Initialization - || 


Routing Table 


@ Assume that A and X do not share any digits in the prefix ( 
General Case). 

@ The first row of the routing table is independent of the 
nodeld ( No Match). X can A’s row to initialize its first row. 

@ Every node in the path from A to Z has one additional digit 
matching with X. Let B; be the i" entry in the path from A 
to Z, ee 

@ Observation: B, shares i bits of its prefix with X. Use its 
(i + 1)" row in its routing table to populate the (i+ 1)" row 
of the routing table of X. 

@ Finally X transmits its state to all the nodes in LU RUM. 


So let us now look at the most important question which is how to initialize the routing table? 
So if A and X, their first i digits match, well, nothing better. We just take the first i digits, the 
first i rows and we just copy them from the routing table of A to X. So let us just assume the 


more general case, the more difficult case where A and X do not share any digits in the prefix. 
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So that is the more difficult one and if it is all that, the other ones automatically follow. So this 
means that the first row of the routing table is independent of the node ID. There is no match 
in terms of digits, but that is okay. The first row of the routing table in any case is supposed to 
indicate, where certainly no digit matches. So let us say for the node ID, A the first digit is F, 


and for X the first digit is let us say E. 


So A will pretty much contain pointers to all nodes whose first digit is from 0 to E, if it does 
not have a pointer to a node whose first digit is E, then now we have a node whose first digit 
is E which is X, so that can be added to A's routing table, so that repair process can be done 
right here. But well we can discard the E part over here and for the first row of the routing table 
of X, the we can get pointers to all the nodes whose first digits are 0 to D, E is of course not 


required. 


And the first digit and the node whose ID is first digit is F, that is actually node A and that can 
be added to the routing table of X. So what X can do is it can take the first row of the routing 
table of A, make a minor modification as was just discussed and populate its own routing table. 
Then every node in the path from A to Z will have one additional digit matching with X. So let 
Bi be the i'th entry in the path from A to Z. 


So Bi shares i bits of the prefix with X. So what we can do is that we can use the (i + 1)*" row 
of the routing table of Bi to populate the (i + 1)" row the routing table of X. So recall that the 
(i + 1)*" row of the routing table essentially means that the hex digits, see the (i + 1)" row 
essentially means that the first i hex digits of its prefix match, with the key or whatever we are 
trying to look up and the (i + 1)*" row essentially contains all those digits that do not match. 


So we have 15 such choices. 


So what we can do is that if there is a match for the first i, we just access the (i + 1)" row and 
copy the routing table from the Bi’th node to X and make small modifications as was discussed 
in the first, in the case of A and X. So this process continues. So X can essentially use the 


information from the path from A to Z, so this I will explain slightly graphically as well. 
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6 8 4 @O>°: 


So if we look at it we had node A which was contacted first, node A took X’ node ID and got 
node Z. Then it sent it through a set of nodes to Z. So this was the message was sent via a set 
of nodes, and let us call them B1, B2, B3, and so on. So any Bi on this path basically means 
that the first i digits of the prefix match. This is what it means, so the first 1 digits. So this is the 


precise meaning, that first i digits match. 


So then what we do is that we look at Bi and we take a look at the (i+ 1) 


row of its routing 
table. So this is basically meant to contain pointers to all other nodes, whose (i + 1)*" digit is 
one among these 15 columns. So what we can do is that we can transfer this, transfer this row 
to the routing table of node X and the transfer is guaranteed to be correct. The reason for this 
is that the first i digits match, so the (i + 1)" row will contain the pointers for an additional 


match. 


And hence, the information across Bi and X will be the same. But of course, since Bi and X, 
the digits do not match, one or two of these entries will not be relevant, but those small 
corrections can be made. So as we proceed from A to Z, we will get all the rows from the 
routing tables that are relevant and all of them can be added to the routing table of X. So it will 


now have a fairly reasonably well populated routing table. 


So now we come to the trick question? Why do we need a neighborhood set? so the question 
is that what was the reason, the need for us to choose a node A which is in the kind of the 
proximity region of X and then why did we go from A to Z? What is the advantage? Well, the 


advantage is like this that we need to actually start from a disadvantage. 
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So one disadvantage of DHTs is like this that let us say that I have a key and the node that owns 
the key is just, it is close by, let us say is in the same data center. So what would happen in the 
DHTs since we proceed by node IDs and not by actual physical locations? Well, what I am 
going to do is I am going to jump, jump, jump and ultimately come to the node, but so this 


process of sending messages, it might take me throughout the world. 


It might take me halfway around the world in the sense that I might give a request in London 
and the next server might be in Tokyo, the next one might be in New York, the next one might 
be in Stockholm and the final one might be in Sydney, the final node. So the network latency 
in this case will be high, so I still want to have the advantages of reliability, but I would also 
like the messages to sort of go in proximate neighborhoods such that the messages also kind of 


travel fast. 


So because of that what we did is we did a small trick in the sense that we start from node A, 
which is close to X in terms of geographical proximity and furthermore, given the way that you 
look the way that we have populated these tables, we can kind of assume by induction that 
when A would have been added, A would have contacted some other node in this which is 
close by and essentially when that node would have added it would have contacted some other 


node which is close by. 


So pretty much all the entries of the routing tables point to nodes that are in the proximate 
neighborhood. So the entries of the routing tables fundamentally capture a notion of 
geographical locality. And this is ensured by the way that we actually construct the DHT. Soa 
notion of geographical locality is kind of already captured. Why? Well, because when any node 


joins, it will always start with contacting a node that is close by. 


And that node will then look at its routing table but its routing table would anyway consist of 
nodes that are close by because typically in the node space we will have many nodes that have 
matching prefix digits and so on, but out of that we should choose the one that is close by to 
kind of reduce latency and this as you can see is kind of being ensured by construction. The 


way that we are constructing this that we always contact the node that is closed by. 


It will always route the message among its close contacts, then those nodes will route the 
message among their closed contacts. So with this itself we are seeing that all the nodes on the 
path are relatively expected to be geographically closer to X. It will not be totally random. So 


which means that we are not going to crisscross the world while sending a message from A to 
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Z. So that is an important point to keep in mind that the neighborhood is handy in this case. It 


is a performance enhancing technique. 
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Table Initialization - || 


Routing Table 


@ Assume that A and X do not share any digits in the prefix ( 
General Case). 

@ The first row of the routing table is independent of the 
nodeld ( No Match). X can A’s row to initialize its first row. 

@ Every node in the path from A to Z has one additional digit 
matching with X. Let B; be the i" entry in the path from A 
to Z, p54 

@ Observation: B; shares / bits of its prefix with X. Use its 
(i + 1)" row in its routing table to populate the (i + 1)" row | 
of the routing table of X. 

9 Finally X transmits its state to all the nodes in CU RUM, 


So let us now come to the last point of this slide is that once X has been added to the DHT it 
transmits, its states, it makes itself known to everybody in its leaf set, all the nodes in its routing 
table and all the nodes in its neighborhood set such that they can also update their internal state 
to take X into account, because as we have mentioned the leaf set particularly needs to be very 


accurate. So the leaf set has to, the nodes on the leaf set have to update themselves. 


The rest of the nodes on the routing table and neighborhood set can gradually, lazily do it, but 
all of them essentially have to record the fact that X is in the system and X has been fully added. 
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Node Departure — Leaf Set 


@ Nodes might just fail or leave without notifying. 
Repairing the leaf set 

@ Assume that a leaf £_, fails. (-L/2 < k < 0). 

@ In this case, the node contacts £_; )o. 

@ It gets its leaf set and merges it with its leaf set 


@ For any new nodes added, it verifies their existence by pinging 
them. 


So node departure, well, nodes might leave with or without notifying, say if a node is 
voluntarily leaving, it might let other nodes. No, otherwise it is possible that it just fails. So, if 
it fails it will just fail, it will not let anybody know. So assume that if L_, with ID -K fails, 


which means that it is on one side let us say that it is on the anti-clockwise side. 


So in this case once the node is aware of it, so the node can send periodic heartbeat messages 
to the leaves and once a node is aware that a leaf has actually failed, well, then it needs to 
reconstruct its leaf set. So it can contact this node, which is at the end of its leaf set. So if this 
node fails, it can contact the node at the end of its leaf set, get an idea of its leaves and then 


reconstruct its leaf set accordingly. 


So it will essentially merge its leaf set and reconstruct. So what is the key idea? Well, the key 
idea is that every node keeps a track of its leaf set by sending heartbeat messages, whenever a 
leaf kind of dies and this fact is discovered other nodes on the leaf set are contacted, they send 


their leaf set information to the node and the nodes in the vicinity reconstruct their leaf set. 
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and Locality 


Node Departure — Routing Table and Neighborhood 


Repairing the Routing Table 
@ Assume that R(/. d) fails. 
@ Trytogeta replacement for it by contacting R(/, a')(d 4 a’) 


@ If it is not able to find a candidate, it casts a wider net by 
asking R(! + 1,d')(d 4 a’) 


Repairing the Neighborhood Set (M1) 


@ Anode periodically pings its neighbors. 


@ If aneighbor is found to be dead, it gets . from its neigh- 
bors and repairs its state. 


Repairing the routing table, well, if a node leaves, then let us assume that at row L this entry 
fails, so let us try to get a replacement for this entry. So let us contact some other entry on the 
same row. So it is guaranteed that the prefix match, the number of prefix digits that are 


matching on the same row will be the same. So let us contact another entry. 


So that other entry, that knows of some other node, which satisfies this criterion, then fine, it 
will send that to the routing table and the routing table will update itself. But if you are not able 
to hold on, before saying that I would like to maybe show this graphically, so let this be the 


l’th row and let us assume that the entry for digit d fails. 


Then we send a message for entry digit d’ and, so basically this would point to a server, so this 
would point to a server, a message is sent to that, and it is and we asked the server, hey, do you 
know of any other node whose first L digits match between the first L digits are the same as 
your node ID? but the (L + 1)” = d. If it has such an entry in its table, which is not the same 


server that failed, then it transfers that and that is used to update this entry over here. 


If we are not able to find a candidate, well, then we cast a wider net. So what we do is that we 
start asking other servers with a higher degree of prefix match, not a lower degree, but a higher 
degree of prefix match and we ask them that look, so essentially the message that is sent is, that 
I had an entry at R, 1, d, that server has failed, so do you know of some other server whose first 


L digits are the same as your node ID and (L+ 1)" =d. 
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So if those other servers have some record of some other node which has this property, they 
will send it and the routing table can be repaired. How do we repair the neighborhood set? 
Well, a node periodically pings its neighbor. If a neighbor is found to be dead, well, then what 
do we do, what we do is we get m from other neighbors and we try to repair the state. Similar 


to a leaf set repair. 
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Artival, Departure, ari Locality 


Maintaining Locality 


@ Assume that before node X is added, there is good locality. 


@ All the nodes in the routing table point to nearby nodes. 
@ We add a new node X 

@ We start with a nearby node A and move towards Z(closest 
by nodeld) 

Let B; be the i” node in the path. ( Induction assumption: B, 
has locality) 

@ Bis fairly close to X because it is fairly close to A. 

@ Since we get the i” row of the routing table from B;, and B, 
has locality, the i” row of the routing table of X 

@ Thus, X has locality 


Induction hypothesis proved 


Maintaining locality, it comes back to the trick question that I had asked a few slides earlier, 
so assume that before node X is added, there is a good amount of locality. Locality in the 
routing tables and so on. So this is geographical locality that all the nodes in the routing table 
point to nearby nodes. So when we add a new node X we start with a nearby node an and move 


towards Z, we just kind of the closest by node ID. 


So let Bi be the i'th node on the path. So the induction assumption is that Bi has locality. So Bi 
will be fairly close to X because it is fairly close to A and A is close to X. So using a similar 
logic we can show that all the nodes on the path from A to Z will be reasonably close to X and 
so this geographical proximity will be maintained, which essentially means that if we were to 
search for some key on X it will always pass it via nodes that are close to it, so the latency of 


transmission is lower. 
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Pastry Arrival, Departure, and Locality 


Tolerating Byzantine Failures 


@ Have multiple entries in each cell of the routing table. 
@ Randomize the routing strategy. 


@ Periodically send IP broadcasts and multicasts (expanding 
ring) to connect disconnected networks. 


Can we tolerate more failures, particularly random failures, Byzantine failures where nodes are 
deliberately malicious? Well, yes. Have multiple entries in each cell of the routing table. So in 
the routing table instead of having one entry, have multiple entries. We randomize the routing 
strategy in the sense if we have multiple servers. For some keys we send it to server S1, for 


some to S2, for some to S3. 


S1 is let us say malicious node, at least the rest of the messages can reach. We periodically try 
to discover parts of the network by sending IP broadcast, multicast messages, such that let us 
say if nodes on the leaf set or on the neighborhood set or routing table entries have died, we 
can again reconnect the network. So we always, so DHTs are in continuous repair mode, like 
our immune systems, they are continuously checking by pinging other notes and self-repairing 


themselves. 
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Outline 


@ Pastry 


® Results 


A quick look the results, so the results are, of course, there in the main paper, but we can still 


briefly look at some key parts of the results. 
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@ 100,000 nodes 


@ Each node runs a Java based VM 
@ Each node is assigned a co-ordinate in an Euclidean plane. 


So this kind of old paper, so modern DHTs are much bigger, but this was a simulation that had 
100,00 nodes. Each node runs a Java based virtual machine, and it is assigned a coordinate on 


an, on the Euclidean plane X, Y plane. 
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Performance Results 


@ The average number of hops varies linearly with the num- 
ber of nodes (in the log scale). 
@ 2.5 hops for 1000 nodes. 4 hops 100,000 nodes. 
@ For 100,000 nodes and 200,000 lookups the probability dis- 
tribution is as follows. 
© 2 hops - 1.5% / 
@ Shops - 16.4% 
@ 4hops - 64% 
@ Shops - 17% 
@ With a complete routing table, the hop count would have 
been30% \lower. 


Source [1] 


So the average number of hops is varied linearly with the number of nodes, so what we see is 
that the hop count does follow a log trend. So this is clearly visible in the paper, I am just 
summarizing the key results that for 1000 nodes we have two and a half hops roughly, or 100,00 
nodes 4 hops. So it shows that as the number of nodes scale, the hop count increases in a log 


scale, which is of course, great news for us. 


And for 100,000 nodes and 200,000 lookups, the probability distribution of the hops is like 
this, that one and half percent two hops, most of them are between 3 and 5 hops, 3 hops = 16.4 
%, 4 hops = 64% and 5 hop = 17%. So, well, one important point that needs to be kept in mind 
is that in a DHT, we never have a complete global view of the network that is because we 


always have a rather local view. 


X contacts node A, node A creates a path to node Z, that is how X gets to know about, at least, 
some part of the network. So the routing tables are by definition incomplete, because for such 
a large network finding a global view is hard. So, if let us say I somehow have a global view 
and I am able to create a complete routing table, which has all the information that is needed, 


not that is needed but that is there in the system. 


Then of course, my routing tables will be much bigger and the hop count will lower, but not 
substantially lower. That is an important point to keep in mind, that the hop count will only be 
30% lower, that is what the paper says and this, if you look at it is actually a fantastic result, so 
it shows that even with a limited amount of information, we are able to do quite well. So we 


may not have a global view of the network. 
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So global view of the network is feasible only for small networks, but for large networks, we 
kind of look at a small region and we do our job. So this does imply a sense of global optimality, 
not optimality but we can say that it is not that bad. A 30% reduction in average hop count is 
okay, is something that is tolerable, not a reduction, a 30% increase is tolerable given the fact 


that we have a practical protocol. 


(Refer Slide Time: 95:49) 


# Pastry: Scalable, decentralized object location and routing 
for large-scale peer-to-peer systems by Antony Rowstron 
and Peter Druschel in Middleware 2001 


So the original Pastry paper was published by Antony Rowstron and Peter Druschel, in 
Middleware 2001. So I would request all viewers to read the paper. The paper has all the details 
of the experiments, the protocols, everything. And of course, if somebody really wants to learn 
Pastry, the best way is to implement the Pastry algorithm using a popular language like Java or 


Python, and actually see it running. 


That is the best way to learn this. Thank you very much for bearing with me in this long lecture. 
So next we will discuss a quad DHT, which is conceptually similar to Pastry in the sense it 


does have the same concepts, but it is architected very differently. 
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3 Generation P2P Networks 


Freenet 


Smruti R. Sarangi 


Department of Computer Science 
Indian Institute of Technology 
New Delhi, India 


Welcome to the discussion on 3rd generation peer2peer networks. So we will discuss Freenet. 
So Freenet is also a precursor to the dark web. So we will look at this purely from an academic 
point of view, but there is, of course, a huge potential for such technologies to be misused and 
they are also being misused. But our discussion will only look at the academics part of it and 
we clearly do not support or endorse or promote or encourage any form of misuse. The 


discussion for this lecture is completely of an academic nature. 
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Outline 


@ Overview 
® Queries 
® Data Storage 


@ Details of the Protocol 
® Message Format 
@ Naming and Searching 


(8) Evaluation 
® Setup 
@ Results 


So we will look at queries and the data storage protocol in Freenet, details of the protocol and 


look at few of the evaluation results published in few of the original papers. 


Design Goals 


 Freenet is a 3 generation peer2peer network. 
@ Features: Publication, replication , and retrieval of data. 


@ Main feature: It is a dark net = Not possible to find the true 
origin of a file. 


@ Wonder why Freenet has additional security !!! 

@ Design goals: Anonymity, Deniability of storers, Efficient 
storage, Reliability 

@ The storage is meant to be temporary (not necessarily per- 
manent) 


Available at: http:/reenet.sourceforge.net 


So Freenet is a 3rd generation peer2peer network. The main features are that we publish data, 
so different nodes can publish data, whatever it is, so unlike Nutella and Napster, Freenet is 
meant to share everything. There is replication in the sense that unlike other mechanisms, 


where the data is not replicated, you just say that where a piece of data is stored, so in this case 


the data is replicated across the nodes and data is retrieved. 
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So the main feature of Freenet is that it is a dark net, which means it is not possible to find the 
true origin of a file. So people will actually not know who is requesting for the file to a large 
extent and people will also not know who is the original provider of the file; so hence, it keeps 
both the provider as well as the requester of the file, it keeps both of them anonymous. This is 


very powerful. So there can be a lot of negative uses of this. 


But it is important for us to understand this of how exactly it works. So the design goals are 
clearly in Freenet anonymity, the deniability of stores, so let us say one machine can say that I 
was just distributing the file, I was not creating it, I was not storing it, efficient storage and 
reliability, of course. So in this case the storage everywhere is meant to be temporary, it is not 


made to be permanent. 


Because you need to understand that this protocol was completely made from a legal point of 
view where it can be denied that a certain machine or a certain user has actually created a file. 
So deniability was a very important part of this protocol, particularly from a legal angle, so a 
large part of this protocol is available online, it is available on the web, so you can download 
it from freenet.sourceforge.net. So last time we actually checked the link was active, but it has 


been a while now. So you can look it up. 
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Q Overview 
® Queries 
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A Freenet Node 


A Freenet Node 
@ Each node maintains its local data store 


@ Dynamic routing table: address of other nodes and the keys 
that they ( might ) hold), Tame hoy 


@ A node knows only about its immediate neighbours ( not | 


others } 


Freenet Query 
@ Queries are sent to a node that can pass it to its neighbours. | 
@ Each query has a TTL (time-to-live field) that is decre- | 
mented at every hop Oo 
@ Aquery has a pseudo-random ieniilier. This ensures that | 
there are no cycles introduced while forwarding queries. | 


So overview, let us look at the queries. So a Freenet node looks like this. So each node 
maintains a local data store. So along with the local data store there is a routing table, which 
has the addresses of other nodes and the keys that they might hold. So we have the notion of a 
key similar to the one that we had in Pastry. So let us say we take the name of the file, we hash 


it and we create the key. 


So the routing table is not the similar as it was in Pastry or as it is in a DHT, but the idea is 
similar where we have a set of keys and we have a set of nodes that may possibly hold the value 
associated with the key, in this case the contents of the file. So node as such is not aware of the 
entire network, because if it is aware of the entire network that is a security hazard. It is aware 
of its immediate neighbors and there may be a degree of trust between the node and its 


neighbors. 


So a Freenet query is like this, it is similar to a Nutella query, where it is sent to a node that can 
pass the query on to its neighbors. Each query has a TTL field, a time-to-live field; that is 
decremented at every hop. So, this ensures that we do not flood the network with messages. 
Furthermore, every query has a pseudo-random identifier, so this ensures that there are no 
cycles while forwarding queries, otherwise what will happen is that we will forward a query 


and then you know it can go around in cycles. 


So to ensure that there are no cycles, we have a pseudo-random identifier. So this basically 
means that a node does not forward the same query twice. If it sees that it has already forwarded 


this query it will not forward it once again. 
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Steps in a Query 


@ The node hashes the name of the nid This is the key 

@ In its routing table, it looks up the key that is closest to the 
key , and passes the request to its owner . 

@ If anode finds the file, it returns the contents, along with its 
address (Saying that itis the owner of the data). 

Q Otherwise, it finds the nearest key in its routing table, and 
forwards the request to that node. 

@ Ifthe request is ultimately successful, then the nodes on the 
way will: b 

@ Cache the file 


@ Create an entry in their routing tables, and record the origi- 
nal source 


So the steps in a query are like this that the node will hash the name of the file, it will create 
the key. So in its routing table it will look up the key that is closest, it will look up which key 
is closest to the key and it will pass the request to its owner. So it basically, this is like adding 
a little bit of Pastry to Nutella, so if you do not have the key at least you go to the owner of the 
closest key. If a node will find the file it returns the contents along with its address and it always 


claims that it is the owner of the data even though it may not be. 


So this is, as we will see later, this is a way of essentially adding more noise, more entropy to 
the system, but always whenever a node forwards, it says that it is the owner of the data or it is 
implied. So this is essentially a way of not having a legal trail. Otherwise as we discuss, it finds 
the nearest key in its routing table and forwards the request to that node. So given that there is 


a TTL field, a time-to-live field, the messages will not percolate in the network forever. 


But if the request is ultimately successful, then what will happen is similar to Nutella, the 
message will come back hop by hop. So each node on the way will cache the file. It will create 
an entry in its routing table and record the original source, whoever provided. But the original, 
so we do not know, this is not the, original source in the sense it is not the file that was the 


original owner, but it is one of the owners that is faking it and saying that look I am the owner. 


It is essentially the node that is providing the file, but that actually may not be the original 
owner, it is the source, but the original owner can be somewhere over here. Maybe some other 


query happened and that is how this node got it, but as far as we are concerned the nodes on 
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the path, on this path will record that look this is the key and this node over here had actually 


supplied it. 
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Steps in a Query - II 


% 


is reeyhbgy 
@ |fanode cannot forward the request to another node: 
@ Creates a cycle 
@ Failure 
@ Try the key with the second closest distance. 


@ At every hop decrease the TTL till it reaches 0. 
@ To reduce the network load, the TTL can be dynamically de- 
creased, 
@ Nodes can decided to process which request nex! based 
on the TTL Q 


- 


ty + anonymity 
( 


So the main idea is that to kind of take care of legal issues, the more that you can confuse the 
better it is. So the important point over here is why are we introducing this? So the reason we 
are introducing this is, of course, yes, BitTorrent and Freenet have been used to do a lot of bad 
in the world, but as computer science researchers, if let us say you want to stop it or regulate it 
or control it or investigate a crime that has been committed on these networks, you need to 


understand how it works such that you can do good to overwhelm the bad. 


So that is the main reason that this is being discussed. So, now let us come back to our original 
point. So what we discussed is that anytime a node finds it again the search result goes back 
along the original path, and of course, each one of them caches a copy of the file and it creates 
an entry in the routing table, which points to over here and this may not be the original source 


though, because this might have cached it for some other query. 


So if we cannot forward the request because it is creating a cycle or something we just declare 
failure. Then what we can do is that the protocol does not have to give up. You can try the key 
with the second closest distance instead of the closest distance or you can keep trying. At every 
hop we decrease the TTL by one, the time-to-live field by one and to reduce the network load 


the TTL can also be dynamically decreased. 
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If let us say you are finding that there are too many messages in the network, so let us say this 
is a network and there are too many messages, then you can dynamically reduce it. So nodes 
can decide to process which request next based on the TTL. So needless to say, what we are 
essentially doing is we are adding, we can have a simple equation over here that Freenet is 


essentially equal to Nutella in the sense that we are sending it to neighboring nodes. 


So recall that we had a lecture on Nutella, so if you are not, if you are new to this lecture watch 
the Nutella lecture first, plus Pastry in the sense that we have the notion of keys and routing 
tables. We also had a lecture on Pastry, look that up. And furthermore there is a degree of 
anonymity in the sense that we are trying to confuse people with who the actual source is and 


clearly whoever is the original requester that information is not being propagated. 


So other than my immediate neighbor nobody else knows that actually I requested for the file. 
So this is why I had originally said that there has to be a degree of trust between a node and its 
immediate neighbors, because this immediate neighbors are aware that at least a node may not 


have requested for it, but it forwarded a request. 


So here is also the thing, let us say between a node, I am, this node is me, and this node is my 
immediate neighbor, see, if let us say I forward a request, the immediate neighbor does not 
actually know for sure that whether I requested it or not. Maybe somebody else requested it 
and I am forwarding it on that somebody else's behalf, so the immediate neighbor will not 
know, but at least it will know that I am participating in the protocol, so that degree of 


anonymity is there. 
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Operation of the System 


(a) Routing Table 


—— are | 
key) |address | content (?) 


key |address | content (?) 


So the routing table is simple, it is basically the key, the address and the content, and we do 
end up sending multiple messages, so we can look at this, a sends a message to b, b sends it to 
c, c says look I do not have it, and then it sends a message to e, e sends it to d, and then d sends 
it to b, but then we realize that there is a circular loop, so it comes back. Then d says look I do 
not have it, it goes to, from e it goes to f, maybe f has it, so then it again walks back. And, so 


that is how we send multiple messages in this protocol. 
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@ Gradually over time, information disseminates. 


@ Nodes start aggregating files with similar keys (numerically 
close ) 


@ Popular data gets replicated at a large number of nodes. 


® Routing tables gradually get bigger. 
© New entries get created all the time. 
@ Nodes can discover more of the network without disclosing 
their identity. 


So what happens is the search quality gradually over time information disseminates. So 


gradually over time what happens is that the information that which node has which key 


119 


gradually disseminates. The nodes start aggregating files with similar keys because we go with 
this numerical closeness, so with similar keys gradually what will happen, if you keep on 
running the protocol is that for a single node, data with similar keys will start getting 


aggregated. 


Popular data, especially where data is popular, it will start getting aggregated at a large number 
of nodes in the network. Routing tables will gradually get bigger, because new entries are 
getting created all the time and nodes are also discussing, discovering more of the network 
without actually revealing identities. So with this what is happening is that there was some 


degree of anonymity in Nutella already, but there is a little bit of additional anonymity here. 


The primary reason being that also we are, for every real file we are creating multiple fake 
owners, so for every real file we are creating multiple fake owners. Multiple ones, where pretty 
much whichever path it travels they will cache a copy and they will all claim that they are 
owners. So given the fact that let us say in a large network there are hundreds of nodes that 


claim to be the owner you do not know which one to catch. 


And all of them are just participating in the protocol. So this is similar I remember to one such 
experience in my high school where we all did some mischief. Well, actually a few of us did 
some mischief, but essentially the entire class said that each one of them did it, so the teacher 
did not know whom to punish, because it would not have been a wise thing, a right thing to 


punish the entire class because 90% of the class was innocent. 


But he did not know which 10% was involved, so this is something very similar where we all 
create fake owners and nobody knows who is supplying and also nobody knows who is 
requesting, because since all the nodes are participating in the protocol you do not know if a 
neighboring node is actually requesting for the file or it is forwarding somebody else's request. 


So both, provider as well as the requester, are to a very large extent anonymous. 
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Outline 


@ Overview 


® Data Storage 


Storing Data 


@ User first creates afile key. 

@ She then sends an insert message to its own node (file, 
key, TTL) 

@ If the node finds the key in its routing table, it returns the 
contents of the file. 

@ Otherwise, finds the closest key in its routing table and for- 
wards the insert message to it. 

@ If that insert causes a hash collision , the node passes data 
back to the upstream requester. 


@ Cache the file locally, and create a routing table entry (in 
response to insert). 


Now let us look at data storage. So the user will first create a file key. She will then insert, send 
an insert message to its own node with the file, the key and the TTL. So, if the node finds the 
key is already there in its routing table it will return the contents of the file, otherwise it will 
find the closest key in its routing table and forward the insert message to it. So, here is the fun 


part, the fun part is that any node, which is the original provided actually does not store it. 


It uses the same routing mechanism to send it to a place which has the closest keys. Even at the 
beginning itself if you think about it in the network, if let us say I want to store some piece of 
data, I do not necessarily have to have a copy of it. I just send it far in the network such that it 
is stored in wherever the closest key is, such that it ultimately gets stored somewhere else and 


that is far away from me. 
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If there is a hash collision, of course, the node is passed back to the upstream requester and the 
file is cached locally and we create a routing table entry in response to the insert. The hash 
collision we do not forward it, but the, it is sort of the request bounces, so whichever node had 
forwarded it, it stores it. So the important point should be rather clear over here, the import of 


this entire conversation. 


That if this question has not naturally arisen in your minds, it now should that let us say that 
there is a file which will not get heavily shared, then it will be possible to kind of find out who 
was the original inserter or the file into the network, if there is such a term. What this means is 
who brought this file into the network for the first time? But to thwart that what we do is that 
whenever we insert a file into the network we do not keep it with the node that is actually 


inserting it. 


We essentially let the file propagate through the network towards the node that has the closest 
key and it is stored pretty much at the node that has the closest key. So let u's say there is a 
node and then none of its neighbors have any closer key, so then at that point it will store it 
itself, otherwise it will keep on forwarding. So this will ensure that the file actually propagates 


quite a bit is far away from the original provider, original owner. 


And it gets stored over there, sort of by the proximity of its keys. So that later on pretty much 
the legal liability is gone, it is not possible to identify who had introduced this file into the 


network. 
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No Hash Collisions 
4 
@ If the TTL field becomes 0, without any collisions , then it 
means that the insert is successful. «~? Carer? Ay 
© Let the original requester know. h-hy, 


@ Each node on the path adds the file and key to its routing 
table, It also caches the file. 


@ The original node also adds the file and key to its store, and 
routing table 
@ How do you beat security 27? 


<b thg. @ Nodes along the way lie about 


the data source 


a 
rae 


@ Add their own address or some 
other node's address 


So then assume that there are no hash collisions, so the TTL field becomes 0, without any 
collisions, then it means that the insert is successful and which essentially means that it is kind 
of, let us say the TTL field was k and it just kept going on and on and on without any hash 
collisions, we can have a successful insert. So, there are essentially two ways of terminating 


the process of insertion, one is an approach that we outlined on the previous slide. 


That we stop the search when we are there at the node, which has the closest keys and no 
neighbor has closer keys, so let us call it closest key. So this is a recent addition to such Freenet 
kind of protocols and the other is that we just send it k hops away, just send it k hops away in 
the direction of the closest keys and then when the TTL field becomes 0, you store it. So 
regardless of whatever method is chosen ultimately if the node is inserted the original requester 


is known by sort of back propagating the message. 


Each node on the path will add the file and the key to its routing table, it will also cache the 
file and the original node also adds the file and key to its store. It may do it, but just to beat 
security it may not also do it. Now, how do you beat security? So nodes along the way will lie 
about the data source, in any case they do not know about the original data source, because 


they are just forwarding the data. 


So they will lie about it and they will say I am the owner, everybody will say I am the owner 
and furthermore, what they can do is that in their key, what they can do to add further 


anonymity is that with the key we have an address. So they can put their own address and they 
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can fake that I am the owner or they can say, or let us say if they do not want to have the 


content, so in many cases for storage reasons, we may not want to have the content of the file. 


We just want to have an address and redirect a request over there, in that case they can add 
their neighbors address or who they are forwarding it to, so they can slightly randomize that as 
well, so that will introduce more entropy into the system and make it even harder to find out 


who is the actual owner. 
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Advantage of this Mechanism 


Advantages 
@ Newly inserted files are placed on nodes with similar keys. 


@ Information about newly inserted files can quickly be dis- 
seminated, 


@ An intruder will find it difficult to deliberately introduce hash 
Collisions. 


Advantages of this mechanism - Newly inserted files are placed on nodes with similar keys, so 
this aids the search process. Information about newly inserted files can quickly be disseminated 
and furthermore, if a file is being introduced into the network, inserted into the network, it is 
actually being added to a node that is possibly far away from the original owner. An intruder 
will find it difficult to deliberately introduce hash collisions. It might be possible that you want 
to deliberately introduce hash collisions, but there is a mechanism to detect that and backtrack 


as we have seen. 


124 


(Refer Slide Time: 21:11) 


Data Management 


@ The routing table and data stores are finite structures. 
They follow a LRU (least recently used) replacement policy 
@ This ensures that old files get deleted from the system. 
@ Legal issues : 
e Adata store can/‘deny) the knowledge of the files it has 
@ Take the hash key, and encrypt the contents of the files with 
it 
@ Any node can always decrypt a file, if it knows the key 
@ However, it will not be able to find out what the original key 
-- was. 
PWS @ Example: A data store can always say that it didn't know that 
it had that many pop songs. It didn’t Know which file was a 
pop song 


So, now let us come to the data management aspect of it. So the routing table and data stores 
are essentially finite structures. So they have a limited amount of storage. So they can be made 
to follow a LRU or least recently used replacement policy, where once a file gets very old it is 
not used, it is deleted, some legal issues. So up till now we have said that look the owner of the 
file that is inserting the file into the system for the first time or the consumer of data, both of 


them can kind of deny. But what about the nodes that are forwarding the information? 


So nodes that are forwarding all of this, they can deny the knowledge of the files that they have 
by saying that look I was just forwarding, I did not take a look at it. It may cut ice with the law 
enforcement or it may not. The other idea can be to take the hash key and encrypt the contents 
of the files with it. Any node if it knows the key, it can always decrypt the file. However, it 


will not be able to find out what the original key was. 


For example, a data store can always say that it did not know that it had many pop songs, 
because it did not know which file was a pop song. Well, fair enough, if we are dealing with 
encrypted data, then we can essentially pass around encrypted data all over the network as long 
as the final consumer has an idea of what the key was. So, what is the idea over here? The idea 


over here is while inserting the file, the original owner encrypts the data. 


And it is assumed that all the consumers have some idea of what the key was and then it is kind 
of encrypted and the encrypted versions are transferred all over the network. So then deniability 
does exist in the sense that a node can say look I did not know what I was forwarding because 


it was encrypted and I had no way of seeing what it was. The only catch here is that the final 
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consumer needs to know the key and that cannot happen via Freenet. That needs some other 


mechanism of disseminating the key. 
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@ Details of the Protocol 
® Message Format 


Message Format 


Freenet Message 


@ Packet oriented, self-contained messages: TCP or UDP 
@ Every transaction (search or insert) has a unique !D 


@ All messages contain: 64-bit transaction ID, TTL counter, 
and a depth field. 
e An alfacker can guess the identities of nodes by scanning 
wthe TTL value. 
© To thwart this: With a finite probability when the TTL field 
reaches 1, keep propagating-the request to other nodes 
@ Have another field called depth) that is incremented at each 
‘d hop. It starts with a (> 0) random value. 
» @ Before the destination sends the message back to the 
ai , set the TTL = depth. This ensures that the mes- 
“& — gage will not die before reaching the requester. 


So this issue is there with encryption. This is a serious issue, but as I said in many cases just 
participating in Freenet is a crime by itself even though they may not be fully aware of what 
exactly they are keeping and transmitting and transferring, but nevertheless just participation 
can be deemed criminal based on the jurisdiction and the laws of the region. So the message 


format, well, so we can either have TCP messages or UDP messages. 


So TCP is a reliable form of transmission, UDP is an unreliable form, where we can drop 


packets. Every transaction has a unique ID. So all messages will contain a 64 bit transaction 
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ID, a TTL, a time-to-live counter, and a depth field. So what the attacker can do? Well, the 
attacker is basically maybe a member of a Law Enforcement Agency, it can guess the identities 


of nodes by scanning the TTL value. 


So it will look at the TTL value and figure out where the original request came from, if let us 
say, always the TTL value is set to 10 and let us say I am seeing the TTL value as 5, I can say 
that the original source was maybe 5 hops away. To thwart this with a finite probability, when 
the TTL field reaches 1, it is kind of keep propagating the request to other nodes or will 


dynamically add some random noise to the TTL field. 


So this will thwart such attacks probabilistically. Another approach is to have a new field called 
depth, which we have not discussed up till now, that is incremented at each hop. It starts with 
arandom value. So before the destination sends the message back to the source, we set the TTL 
equal to the depth, so this ensures that the message will not die, before reaching the requester. 


So let me explain the context of the depth field once again. 


So the idea over here is that along with TTL we have another field called depth, which is 
incremented at each hop, but it starts from a random value, that is very important. It starts from 
a random value. So now when we reach the destination, the message has to come back and it 
will use the same TTL mechanism to come back, but the point over here is that if the TTL < k, 


so in this case what will happen is that if, then the message will die, if it is exactly. 


So, let us say that the destination and the source are k hops away and from the destination, 
which is the provider of the file, the message is coming back, so then also we will have the 
TTL mechanism to ensure that all messages ultimately are removed from the system. So let us 
say this is k hops and let us assume that from the destination to the source, when it comes back 
we send TTL = k. Then if let us say one of the nodes along this path, let us say belongs to the 


police, it will have an exact idea of how far the source original, requesting source is from it. 


And this is something that we want to avoid. So you want to avoid the fact that even if there is 
any intruder in the network, it will not be able to get enough information about the source. So 
what we instead do is that we use the depth field which has been initialized randomly. As we 
go from the source to the destination or the requester to the provider, we keep incrementing it. 


So let us say k = 10 and the depth was initialized to 5, so this will become 15 over here. 
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So when you are coming back, we actually sent TTL = 15, so this is guaranteed to be let us say 
> k, so the message will come back, it will reach the source, it will not die in the middle, it will 
not be removed from the network in the middle and furthermore, if there is any intruder, it will 
not get an exact idea of how far the source is. Primarily because of the randomized value of 


depth. So this is an important concept. 


Kindly go over this what I said and also go over this description in the paper, but the key idea 
is come back to how far the destination and source are, say if there k hops away TTL cannot 
be equal to k, otherwise it will give an intruder an exact idea of how far the source is, so this is 
revealing a lot of information. It cannot be less than k, because then the message will die in the 
network. It has to be greater than equal to k and how much it is greater than equal to has to be 
a random value, so that will confuse the intruder, that is exactly what we are trying to do over 


here. 
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A Message Format 
alts of the Protocol 


Timers 


@ For every request, the requester starts a timer. 
© Ifa timer times out, then we infer a failure 
© Sometimes downstream nodes may send Reply.Restart mes- 
sages to the requester. The requester extends its timer. 


So, another method is for timers, for every request, the requester starts a timer. Say the timer 
times out, we can infer a failure, so this is basically an absolute time, so let us say we are not 
able to get a reply, we infer a failure, so downstream nodes may send replied or restart message 
to the requester, if there is a problem, if there is congestion, so then the timer will be extended. 
The requester will extend its time. So this is also another mechanism for ensuring some amount 


of reliability and robustness in the overall message transmission scheme. 
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Successful Request 


@ If the request is successful, the remote node will send the 
Send.Data message. 


@ It can send the id of the data source (or possibly fake it). 

@ It TTL reaches 0, it sends a Reply.NotFound message. 

@ If there are no more paths left and (TTL # 0), then a Re- 
quest.Continue message is sent. 

@ The requester can send the request to other nodes in its 
routing table. 


ond av Ural claat (yy 


So if the request is successful the remote node will send the Send dot Data message. It can send 
the idea of the data source or possibly fake it, so it is important to send more information to 
confuse and fake. If the TTL reaches 0, it will send a Reply dot NotFound message. If there 
are no more paths, but the TTL is not 0, it will send a Request dot Continue message and if that 
is found, the requester can send the message to other nodes in the routing table which are not 
possibly the closest key, but as we have seen maybe the second closest key or the third closest 


key or something like that. so on and so forth. 
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@ Details of the Protocol 


@ Naming and Searching 


Naming and Searching 


Naming, Searching, and Security 


Organizing Files 

@ Might be a good idea to have a directory of keys (files con- | 
taining similar data). 

@ Read: legal issues 4) 

@ If there are no such issues, we can introduce directories , 
and bookmark lists / lak sueb | Hh 

@ Search capabilities: eSynrubi vanjan Sarranyé 

@ Should we have a search engine like Google. Goes against | 
our design goals: anonymity 


@ Solution: use a lot of indirect files containing meta-data, all | 
over the network: list of keywords — keys of the files 
| 


@ To ensure that files have not been tampered, have a 


ontent-hast 


So naming and searching - So organizing files, it might be a good idea to have a directory of 
keys, like files containing similar data, but the main idea is we never did not want to centralize, 
so recall that after Napster our main aim over here was not to centralize data, was not to have 
a central server, was not to have a repository and it was mainly to avoid legal issues. Primarily 


it was to avoid legal issues that arise from doing so. 


And the reason is that you immediately get a list of servers that will cause issues and whoever 
maintains this will be in trouble. If there are no issues we can have directories or bookmark 
lists, so it is common that the dark web, which is built on Freenet like technologies has such 
directories in countries that I will not name, who are beyond the legal and kind of judicial 


purview of many other countries, like ours, so it is not possible to trace them. 
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And so this is done but, of course, you need the kind of the stability of a country to do it, 
otherwise it cannot be in the same legal jurisdiction. The person who is maintaining the 
directory will be in trouble. Should we have a search engine like Google to do it? Well, it goes 
against our design goal anonymity. So the key solution is we will use a lot of indirect files 


containing metadata all over the network. 


So then we will have a dictionary with list of keywords and keys of the files. The key idea over 
here is that let us say you are searching for, let us say my name. So if you are searching for my 
name, so let us say for my first name there will be a lot of places, so actually my name is not 
Smruti, it is Smruti Ranjan; that is the typical way that people are named with such ultra-long 


names in my home state. 


But well, what to do, so this is my big long name. So you can have one set of directories that 
maybe have a list of servers that have this name, the first name, and then you can have another 
set of servers which have links to nodes that have my last name. So then they can be separate 
queries and then you can take the intersection of them to find servers that have links to my 


entire name. So this also is possible, different combinations. 


So clearly there is a trade-off between visibility and efficiency. And also it is possible that 
within the network you might have intruders, which will tamper with the contents of the file 
and they will say that look this is the file, so to ensure this does not have, we can hash the 


contents, and the hashed values can be there with other nodes. 
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Security 


@ Sender anonymity is preserved because: 
@ A node can never say if the node that is requesting a file is 
the original requester, or is merely forwarding a request. 
© Messages between pairs of nodes can also be encrypted 
(against eavesdroppers ) 1; fi 


@ Pre-routing pomdon xk 
@ The requester decides the routing path to a destination if it 
has detailed routing tables. 
@ Encrypt a message with a succession of public keys (for all 
the nodes onthe path) / 2A - fused eroy) 
@ Anode will have no idea regarding the sender. It will only 
') know the id of the next hop. 
| © No idea of the requested key 
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So little bit of security. Sender anonymity is preserved because a node can never say that if the 
node that is requesting a file is original requester or merely forwarding a request. We have 
discussed this. Messages between pairs of nodes, well, TTL is one source of information, but 


we eliminated that source with this depth, field and randomization. 


There could be eavesdroppers, so we can encrypt messages, but subject to the fact that the keys 
are being shared using some other mechanism, the requester may decide the routing path to a 
destination, if it has detailed routing tables. It furthermore can encrypt a message with a 
succession of public keys for all the nodes on the path. So what is the idea? So, of course, here 


the assumption is that you know what public private key or RSA based encryption is. 


If you do not know what it is then kindly take a look at it and restart the video from this point. 
So the idea over here is that as we are traversing the network it is assumed that every node has 
a unique public key. So what we do is we keep encrypting a message with a succession of 
public keys along the path and when it comes back, we keep on decrypting it with the private 


keys. So there are advantages of such mechanisms. 


I am not going into the details. There are tons and tons and tons of mechanisms of adding more 
security to Freenet, so I will not discuss this in great detail. I will just discuss a few of the 
solutions that do exist. So in general a node does not have any idea regarding the sender, 
because it only has 1 hop visibility in this network and it also does not have a clear-cut idea of 


what is the key that is being requested. 


But the main issue with encrypting the keys in this fashion is that the protocol as we have 
explained, where you take a key and you do a lookup will not work, and you are also assuming, 
whenever there is some encryption that there is some other way in which the keys have been 
shared. So that is very important. So I am not discussing this in great detail because there is a 
lot of work already in this area and this is technically beyond the scope of this current 


discussion. 
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® Setup 


Evaluation Setup 


@ Network has between 500 to 900 nodes. 
@ 40 items in each node. 

© Routing table size: 50 addresses 

@ Network topology: Chain 


So little bit about the evaluation results. If you see the paper they will look at a network between 
500 to 900 nodes with 40 items in each node. The size of the routing table is limited to 50 
addresses and it is a linear network, this is not the best network to test, but nevertheless it is a 


chain based linear network. 
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® Results 


Results 
Evaluation 


Successful Requests(%) 


vs #Queries 


@ The number of queries were varied from 50 to 1200 


@ For 500 nodes, the percentage of successful requests rose 
quickly from 20% (at 50 queries) to 100% (at 300 queries). 

@ With 600 to 900 nodes, the percentages started at roughly 
10%. 

@ They reached close to 100% for more than 400-500 queries. 


@ More are the nodes, lesser is the percentage of successful 
requests. 


So the results that they show is that for a query rate that was varied from 50 to 1200, for 500 
notes the percentage of successful requests rose quickly from 20 to 100. So that is because as 
there are more and more messages in the network, information disseminates and that too it 
disseminates rather quickly. With 600 to 900 notes things started slow at 10 percent, but after 


400 - 500 queries they reach 100 percent. 


So pretty much all such networks information does get disseminated quite rapidly, and 
ultimately once things stabilize most of the search results are successful. More are the nodes 
lesser is the percentage of successful requests, higher is the warm-up time, but the ultimate 


results are kind of good and acceptable, for at least what you get. 
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#Hops vs #Queries 


@ Experiments were conducted with 500, 600, 700, 800 and 
900 nodes. 

@ The #hops reduced quadratically from roughly 50 (20 queries) 
to 10 (> 600 queries). 

@ More are the nodes, more are the hops. 


th 


Number of hops versus queries. So experiments were conducted with 500 to 900 nodes. The 
number of hops reduced quadratically from roughly 50 to 10. That is mainly because as you 
have more nodes you have more connections, so that is why the number of hops also reduced. 
So, I just slightly retract the statement. I did not mean to say more nodes imply less hops, what 


I said is more queries imply less hops. So, I stand corrected. 


So what I am trying to say is as we increase the number of queries, so let us say as we go from, 
let us say 20 queries to 600 queries, the information gets disseminated all over the network, so 
you see a quadratic reduction in the number of hops, which is also what you would expect and 
also what you would expect is as the number of nodes increase the average number of hops 


would increase. 


So these are not fundamentally earth shattering results, so you would expect a reduction, in this 
case it is a super linear reduction in the number of hops traversed versus the number of queries. 
And so given that we stand broadly speaking over here, we are now in a position, so this was 


by the way the last slide. 
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? Clarke, lan, et al. "Freenet: A distributed anonymous infor- 
mation storage and retrieval system." Designing Privacy En- 
hancing Technologies. Springer Berlin Heidelberg, 2001. 


And this is the paper that describes the original Freenet paper; that led to a lot of work in this 
area, but the key points that actually anonymize Freenet are essentially one hop visibility. So 
this is called, let us call it request denial, which basically means that you will never reveal 
whether you are the original requester or you are just forwarding somebody else's request. Then 
the other is also sender denial, where you are, you will basically say that look either I am not 


the original owner but I will fake, I will have the notion of fake owners. 


I will say that look Iam the owner even though you may not be, and furthermore, in the routing 
tables if you are not actually storing the value, which means you are not storing the file you 
will point to one of these fake owners, then you have the issue that at the time of insertion, the 
node that is inserting the file will not keep a copy, but it will actually send it far away in the 


network. It will send it to a remote node with possibly the closest key. 


So that will cache a copy of the file including some of the nodes in its neighborhood via of 
which the message reached it. And the last is that the TTL field can reveal some amount of 
information, so we will also incorporate a depth field, which will add a degree of 
randomization. So these are pretty much all the steps, which give Freenet a degree of 


anonymity, which is not there in other systems such as Pastry and Nutella and so. 


And of course, we do not have any centralized server or any kind of a centralized directory, 
which has a big key and server list. So this is where Freenet ends. So Freenet has, of course, 
matured significantly, so there is a huge dark net and you have the Tor browser and so on, so 


you can take a look at the security issues involved in such technologies and how law 
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enforcement agencies try to keep a tab on these networks and find offenders. Thank you very 


much. 
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Smruti R. Sarangi Chord 

So, we will discuss the second distributed hash table algorithm called chord. So, a prerequisite of 
this lecture is the Pastry DHT, the Pastry distributed hash table. So, this is a prerequisite of this 
current lecture. So, first we will discuss Pastry. So, that has been done in a previous lecture. So, 


now we will discuss chord. 
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0 Overview 


@ Design of Chord 
® Basic Structure 
@ Algorithm to find the Successor 
@ Node Arrival and Stabilization 


Q Results 


Smruti R. Sarangi Chord 


So, the lecture will go like this overview, design of chord, the basic structure, algorithm to find the 


successor, node arrival and stabilization and finally the results. 
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Overview 


_ Comparison with Pastry 


Oo. 
Solid 


Chord vs Pastry 


- Each node and each key’s id is hashed to a unique value. 


@ The process of lookup tries to find the immediate successor 
to a key’s id. 


@ The routing table at each node contains O(/og(n)) entries. 

@ Inserting and deleting nodes requires O(/og(n)*) mes- 
sages. 

@ Sarangi View © : More robust than Pastry, and more ele- 
gant. 


Smruti R. Sarangi Chord 


So, chord versus Pastry. So, the key idea is the same that each node and each key’s id is hashed to 
a unique value. So, this is same, but the process of lookup is different. So, recall that in Pastry, the 
process of lookup tried to find the node that was closest to the key. But in this case, we do not do 
that, what we do is instead that in the circle, we try to find the node that is the immediate successor 


of the key. 
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So, we take the key’s id and we try to find the node that is the immediate successor, when we are 
traversing the circle clockwise. The routing table at each node will contain roughly on an average 
O(log(n)). And inserting and deleting nodes will require O(log(n)*) messages. It is just my view 
that this is more robust and Pastry and far more elegant as a solution. But that is just my view, and 
you need not subscribe to it. So, you make your own views by the time we reach the end of this 


lecture. 
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Overview 


| Comparison with other Systems 


@ The Globe system assigns objects to locations, and is 
hieararchial. Chord is completely distributed and decentral- 


cat) 
| CAN ( 7 rcuthe ) 


— 


Uses a d-dimensional co-ordinate space. 
@ Each node maintains O(d) state, and the lookup cost is 
O(aNn‘/*), 
/ @ Maintains a lesser amount of state than Chord, but has a 
¢ higher lookup cost. 
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So, comparison a few of the other systems. So, by the time that chord actually came, there were 
many more DHTs in the research market. So, there was a globe system, which used to assign 
objects to locations, it was hierarchical. As compared to that chord is completely distributed and 


decentralized. So, of course, chord also similar to Pastry uses a ring-based overlay. 


If you do not want to use it, you can use CAN. So, CAN users, or d dimensional coordinate space. 
So, instead of a circle, it actually uses a d dimensional coordinate space that is basically a 
hypercube. So, a square, a cube and hypercube. So, it is a d dimensional coordinate space. You do 


not know what is a hypercube? 


This is the right time for you to take a look, just Wikipedia for it. So, each node maintains O(d) 
state, which is basically the d neighbor in the d dimensional coordinate space. And the lookup 
causes O(dN 1/2), So, it does maintain a lesser amount of space than chord but has a higher lookup 


network. 
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So, CAN is one of the immediate competitors of chord. And the interesting part is that as opposed 
to a circular overlay, which Pastry also uses, chord also uses CAN has more like this d dimensional 
hyperspace. So, this is a worthy competitor, but we will still find the elegance of court to be 


something that will really endear it to us. 
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@ Automatic load balancing 
@ Fully distributed 


@ Scalable in terms of state per node, bandwidth, and lookup 
time. 


G @ Always available 
@ Provably correct. 
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So, what are the features of chord? Well, you have automatic load balancing. It is fully distributed. 
In terms of state per node, bandwidth and lookup time it is fully scalable. Furthermore, there is 
greater availability, it is always available in the sense that there is a built in amount of redundancy 


into the chord protocol. 


So, that is the reason that availability is high and second it is provably correct but the same was 
with Pastry also, that these are simple protocols is easy to prove that they actually work. I mean, 


all corner cases are taken care of. And they work in a robust, provable fashion. 
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Design of chord, we will take a look at the basic structure. 
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| Consistent Hashing 


Definition 

Consistent Hashing: It is a hashing technique that adapts very | 
well to resizing of the hash table. tego CD seater need to | 
be reshuffled across buckets. k is the number Of keys and nis the | 
number of slots in a hash table. } 
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So, the key idea of chord is called consistent hashing. So, it is a hashing technique that adapts very 
well to actually resizing the hash table. So, what is the key idea? It is that typically “ elements need 


to be reshuffled across buckets. So, what happens is look? 


When you have a large space, and you have these nodes that are also called buckets, and you add 


a node in the middle, then you need to move some keys from here to here. So, that needs to be 
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done. So, in Pastry, it was a two way move, but in chord, it is a one way move. So, this is again 
one more advantage. That is because we are storing the immediate successor immediate clockwise 


successor that is. 


So, on an average, the number of nodes that need to be moved. And the same is for an add in for a 


delete as well, if later on, we delete this, then some nodes have to be moved from here to here 
ee k . . 
back. So, the average movement is limited to = elements across the buckets with k is the number 


of keys and n is the number of slots in a hash table. 


Number of slots in a hash table basically refers to the number of nodes in this case. So, “ is the 


typical movement, which is anyway, on expected lines, it is not unexpected. And of what we are 
basically seen. Over here, is that we will see how consistent hashing plays a role in chord. So, here, 


of course, in this figure, we are using the term bucket, so we will get to it in some time. 


But essentially, what this figure over here is showing is that on big circular ring, different keys are 
hashed to different positions. And then we group them into buckets. This basically means that a 
set of keys in one bucket are assigned to the same physical node. So, I am not showing the node 
over here. But if there is a node, then there is entire bucket gets assigned over here, because we 


assign them to the immediate clockwise successor. 
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| Structure of Chord 


72) 
@ Each node and key is assigned a m bit identitier. 
@ The hash for the node and key is generated by using the 
SHA-1 algorithm. 72002 (57 2 rest) 0? a 
; _ Riy 4 ee | \ 
@ The nodes are arranged in a circle fecal Pastry). le / 


@ Each key is assigned to the smallest node id that is fafger 
than it. This node is known as the successor . 


@ For a given key, efficiently locate its successor. 


9 Efficiently manage addition and deletion of nodes. 
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So, the structure of chord is like this slightly different from Pastry. So, each node and a key is 
assigned a m bit identifier. So, it is m bits, will discuss what is the value of m. So, the hash for the 
node and the key is generated using the SHA 1 algorithm. So, it is a very simple hashing algorithm 


that takes the value of a node, the value of a node basically is essentially the IP address of the node. 


So, given the node, we take its IP address. We pass it through a hashing algorithm. And this gives 
us the hash. And similarly, we take the key, we pass it through a hashing algorithm and that gives 
us the hash. The nodes are arranged in a circle, similar to Pastry. The major difference with respect 
to Pastry is, that each key is assigned to the smallest node id that is larger than it or call it a 


clockwise successor. 


So, that is the most important thing that each key is assigned to the node that is just larger than it. 
Basically, it is clockwise successor. So, the objective is that for a given key efficiently locate its 
successor’s efficiently located successor on this ring, and efficiently manage the addition and 
deletion of nodes. So, those are two aims to manage to ensure that if you want flexibility, can you 


provide it. 


And in this case, flexibility basically means that we can add as many nodes as we want, we can 


remove as many nodes as we want. So, what is an advantage? I will just quickly show it over here. 
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So, on advantage of this scheme is that, let us say we have the circular ring, and we have some 
nodes. And let us say that, it is the time of a festival. Let us say it is Christmas in the US, Diwali 
in India that is when most of the shopping happens. And your favorite shopping site uses a DHT. 


Then what you can essentially do is that you can add more servers. 


So, the new servers will basically be positions along this ring and they can take away a large part 
of the load of the existing nodes. So, the network will rebalance. And as we have seen, roughly ~ 


nodes need to be transferred between for adding every new server. So, this allows us this gives us 


a method to kind of flexibly expand and grow the network. 


So, let us say that when there is a festival you expect a lot of people to be searching things, buying 
and selling things. So, all of those things can be incorporated by adding extra servers in the network 
to share the load. And then when the festival is over many of these new nodes that were added, 


they can get deleted. So, this is what gives DHTs the flexibility. 
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So, now let us see, what are the properties of cords hashing algorithm? Because pretty much the 
uniformity of the hashing algorithm to a large extent plays a rather important role in the design 
and operation of chord. So, for n nodes and k keys as we have said with high probability. Each 


node stores are the most (1 + ©)*/” keys. 


Addition and deletion of nodes leads to reshuffling of order of ~ keys. This is a property of the 


hashing algorithm of chord, which is basically SHA 1. Previous papers have proven that epsilon 
is limited to Odog(n)). And there are techniques to reduce epsilon, if we are really interested about 


the uniformity of distributing keys to nodes, using what are called virtual nodes. 


So, what we can do is that we can make each physical node contain log n virtual nodes. So, then 
if each physical node contains log n virtual nodes, so a virtual node by the way, the full-fledged 
node in so far as the hashing is concerned. So, let us say that we have a consistent we are varying 


like this. And let us say we have a physical node p, then the physical node p. 


If let us say we are creating log n virtual nodes out of it, and then each nodes id will basically be 
the id the IP address of p with the virtual node number. So, this will hash to let us say one point 
on the ring, the other virtual node will hash to another point on the ring. This one will hash to this 


point on the ring. 
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Similarly, if there is another physical node p with let us say, three virtual nodes, one virtual node 
might hash over here, one might hash over here. So, what this would do is that this would further 
increase the spread and the randomness. Think about it that way, that this will further the number 


one, it will increase the number of nodes, so it will, it is not a one-way street. 


So, this will increase the load on the network. And it will just in a sense, create a larger network. 
But also, it will increase the amount of randomness and the spread. So, the benefit will be that the 
number of keys that are actually stored on the physical node. It is expected to be substantially more 


uniform and more homogeneous in comparison with other nodes, which is what we wanted. 


Of course, there are scalability issues, because we are increasing the number of nodes substantially 
log n times, so that is an issue. But if we want uniformity, this is a good strategy, because as you 
can see from the figure, if we have let us had three virtual nodes, they can hash all over the place. 


If this has three virtual nodes, it can hash all over the place. 


So, given the law of large numbers, we can say that look, the probability that the number of keys 
that will be there with each node is a very, very high likelihood that will be =. Where k is the total 
number of keys and n is the number of physical nodes. Let us say, then the number of keys will be 
~. or very close to ~ with p if we follow this virtual node-based scheme. And the reason should be 


clear from this diagram. 
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Let m be the number of bits in an id 
@ Node n contains m entries in its finger table. 
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Basic Operation 
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Smruti R. Sarangi Chord 


So, now let us see, so let m be the number of bits in an id. So, m is the number of bits and let us 
see both the node id and the key id which is produced by the SHA 1 process. And for a node 
whether it is a physical node or virtual node as far as the lookup is concerned, it does not matter. 
It is only when we are looking at the uniform distribution of keys to a node does it actually matter. 


So, we will now define two terms the successor and the predecessor. 


The successor is a next node on the identifier circle. The predecessor is the previous node on the 
identifier circle. So, here is the fun part that we are going to do, we are going to define the notion 


of fingers. So, if you see what is actually a chord? 


148 


(Refer Slide Time: 15:12) 


Say chord basically is that we take a circle, and we cut it like this. So, this is a chord. So, now it is 
very clear why we call this system chord. So, you take any point, whether it is a node or a key that 
is immaterial. So, maybe let us start with a key. So, let us for every key will have a position on the 


ring, then what we can do is that we can, so the biggest chord, of course, will be the diameter. 


And then we can start defining smaller chords, where we kind of half tangle every time, something 
like this. So, as you can see, these are smaller and smaller and smaller portions of the circle, which 
they gradually get, they kind of gradually get bigger, bigger, bigger, bigger and bigger. So, this is 
actually the way that we divide the entire space of keys by drawing a set of straight lines, and as 
you can see their chords. And then the final one is, of course, half the circle, it is kind of the 


semicircle. So, how do we do it? 
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Basic Operation 
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Well, let us first start by mathematically defining it. So, for each of these, we call it a finger. So, 
let us say these individual chords. So, let us say something like this, we actually define this because 
as a finger. Well, it looks more like a pizza slice, but let us calls this a finger. It also looks like a 


finger a finger nail, if you look at it from a certain angle, but I would have called it a pizza slice. 


So, the i’th finger basically contains, so every finger has a start and an end. So, it has a starting 
point. And it has an ending point. So, the start for the i’th fingered its position is like this, (n + 
2'-1) mod 2™. So, why mod that is because you want to wrap around the circle, you want to wrap 


around and come to the other side. So, let us and then the finger ends at (n + 2'~*). 
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So, it is essentially you are increasing the size of the finger by a power of 2, which is something 
that you saw. So, let me maybe show it to you in a table graphically. So, that will give you an idea 
of what exactly I am referring to. So, let us say that, let us say you consider (1 = 1). So, then what 
is the start? So, let us say the id over here as we have seen over here, let this be n. So, n is the 


position. 


So, Iam slightly changing the definition of n here. The previously n was referring to the total 
number of nodes, but now, let this be the position of the key on the ring. So, then, this will be (n+ 
2'-1), which is (2° + 1). This is the start and the end will be (n + 2'*), which is (21~*), which 
is (n + 1). So, then if (i = 2), then this will start at (n + 2). (2771) to (n + 3). 


If (i = 3) this will be (n + 4) to (n+ 7). Similarly, (i = 4), this will be (n + 8) to (n + 15). So, as you 
can see, we are gradually doubling the number of nodes in each finger. So, they are getting bigger, 
bigger, bigger and bigger. Ultimately till it encompasses half. So, as you can see this diagram is 


embodied in this math. So, then for the general i, this will be (n + 2'~4) to (n+ 2'~4). 


So, the first finger is small let slightly bigger, slightly bigger, big, big, big, big, big, and big till it 
reaches the semicircle. So, the first question is, why do we define fingers like this? Well, we will 
see there is a massive advantage. But we will gradually appreciate it. But I hope that the idea of 


these fingers is clear. 


The way that we define these fingers, and each one of them, which are essentially drawing a line. 
So, that is also why the scheme is called chord. Because there is a circle at any time two points in 
a circle are joined by a line, it becomes a chord. So, finger i dot node. So, for every finger, we have 
a start and an end, which is only two positions with finger 1 dot node is the name of a node Id, 


which is the successor, the clockwise successor of the finger i start. 


So, it is a clockwise successor or start. So, what is the basic operation? The basic operation is that 
for a given key, you find the successor you find the node Id that is its successor, which means that 
from the key, we start walking the circle clockwise, until we find the nearest node and that nearest 


node is the successor. 


(Refer Slide Time: 21:23) 
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Design of Chord Algorithm to find the Successor 


| Outline 


@ Design of Chord 


@ Algorithm to find the Successor 
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So, what is the algorithm to find the successor? This is by far the most important algorithm in 


chord. So, we should take a look at this pretty seriously. 
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So, again, just to come back, if I were to look at a circle, and give actual values, so then what we 
would see is for the node 3, and for the 3 fingers who start is (3 + 1), (3 + 2) and (3 + 4), their 
successor or the node or the fingers is not 8. But for (3 + 8), which is a next finger, there is already 
a node over there, so its successor is 11. And if let us say there is a key that arrives, the successor 


of key whose id is 14 — 15. 
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And (3 + 16) is the next finger, its successor would be node 22, and (3 + 32) next successors will 
be node 40. So, of course, this diagram has not been drawn to scale it is meant to show you certain 
things. But of course, it is not fully drawn to scale. So, do not expect a semicircle when you are 


not finding one. It is just an illustrative example. 
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Let us now discuss the algorithm to find the successor. So, the key idea over here is that if I have 
the ring, and I have an id, over here. I want to find the successor node of this id, the one that this 
id is mapped. This CAN be the id of a key or it can be the id of a node. So, it does not matter at 


this particular point of time. 


What it actually is? So, what we do is that we contact one of the nodes that we already know which 
is node n, which is a part of the ring and then we ask you to find the successor of id. So, we call 
the function over here n dot find successor id. So, what we do is that it turns out that to find the 
successor over here, it is actually easier to find the predecessor and since every node has a pointer 
to its successor, we arrive at the predecessor first and then from there we arrive at the successor 


and this node will also be the successor of id. 


So, this is exactly what we do we initialize node n’ which is the predecessor of id from n’ we find 
the successor of n’ which is what we need to return. So, to find the predecessor what we do is we 


follow a recursive procedure. So, I will show this in a slightly bigger diagram. 
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So, consider this to be the ring and consider this to be the starting position and this to be id. So, 
what we do is we find that finger of n which I am calling the closest preceding finger whose finger 
node. So, recall that the finger, fingers[i]. node = successor(finger[i]. start. So, finger [i]. node is 


the successor finger[1]. start. 


So, we find that finger of n, whose finger node is the closest preceding in the sense that if let us 
say start here and a walk anti clockwise, so I encountered many finger nodes of node n, but I want 
the closest one, the first one that I encountered while walking anti clockwise. So, it is possible that 


there could be a finger of this type. 


So, this is the start of the finger. This is the end of the finger. And maybe the node of the finger is 
over here. So, this is the fingers node. So, let us call it the finger node. So, from here, I can 


immediately reach here, which is the best information that I have about id. 


So, I can look through all my finger nodes and find the closest preceding one, which means that 
out of all my finger nodes, this is the one that is closest to it and precedes it. So, this I can do very 
easily, I just need to search within my finger table, node and needs to do that. Let us now blow this 


thing up. Let us have a magnifying glass. And let us blow it up. So, let us go to the next slide. 
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See if I blow it up, and then I look at it. So, it is possible that id is over here. And then the closest 
preceding finger which will again call node n, let me call it node n’ is here. So, again, this will 
have a bunch of fingers. So, one among those will be the closest preceding fingers, finger node. 
So, let us say that n’ will again do the same, it will again find this point, which is the finger node 


of the closest preceding finger. 


So, the way that we have defined it, again, what we can do is we can blow this up. And again, we 
can look at this again, we will have an id over here again, we will have a point and dash over here, 
we can again look at its fingers and again come to a node which is even closer. And this is the 


closest preceding singer of id as per n’s information. 


155 


So, if this is n’, let us call it n’’. Again, I can start from this point which is n’’’ and keep on keep 
on narrowing myself narrowing and narrowing and narrowing until, until I reach a point when id 
is between whatever nodes I am considering, and the successor of the node. So, until id is between 


these two points, I will keep doing. 


And Tam ultimately, given the fact that am continuously reducing the distance to id, an ultimately, 
guaranteed to find such a point. So, at this point, I actually stopped. So, this point tells me that I 


have found my position. So, at this point, I actually stop. 
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So, exactly what we see in chord is something very similar. What we do is that we initialize n’ to 
n, as long as our condition is not met, which one the one that I showed just now, that id is not 
between n’ and n’s successor, I keep on finding the closest preceding finger, keep on setting that 


to n’ and I keep on repeating in a recursive fashion. 


So, I just keep getting closer and closer. So, maybe this is as closest proceeding finger again, I get 
closer, closer, closer and closer. Ultimately, this condition will hold and that is when I know that 
Iam done. So, I will return the value of n’, which is the predecessor and the predecessor successor 
is what I am after. So, now what I need to do is? I need to figure out how the closest preceding 


finger functions is written. 
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n.closestPrecedingFinger(id) begin 


1 
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5 end ae 
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So, the way that this is written is again rather interesting. So, again, I need to draw a big fat diagram 
to explain. So, if I were to consider this as the relevant part of the ring, or maybe I can show you 
the entire thing. Of course, this is not drawn to scale. So, this point is node n. And this point is id. 
So, what I do is, I start working backwards from the biggest finger to the smallest finger. So, this 


could be the biggest finger. So, from here I start walking backwards. 


So, let us say that, if let us say this is the m’th finger, so then of course this is the starting point of 
the fingers. So, then the finger node will be after that, so I am not after this. So, may be let us say 


that this is the m - 1’th finger. And so, then I will start at this is the starting point, again, I will see 
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if there is a finger node, the node of this finger is before Id if it is, then I will return that, otherwise, 


I will look at the starting point of the m - 2 with finger and also look at the finger node. 


So, say the finger node is between n and id, and then I will return that. So, that is the way that I 
will proceed. The reason that I proceed this way anti clockwise and not clockwise is basically 
because I can kind of eliminate most of the circle by considering the largest finger, then I can 


eliminate most of it by considering the second largest, so on and so forth. 


And furthermore, the first time I find a match, which means the what is the match? the matches 
that if let us say the matches for the i’th finger, it means that the node of the 1’th finger is in between 
and then Id, if it is in between them, the first time I find such a match between the first time I find 
that look, this node is over here in between and an id, I can simply return that as the result of my 


function. 


So, this is what I am doing in the chord that I am starting from the largest finger and counting 
down to the smallest. Any time that my function holds, which means that finger 1. node, the node 
or finger i is in between n and id, the first time that it holds. I return finger i. node, I am done. So, 


there is a possibility that I will not find a finger 1 dot node that satisfies this condition. 


When would that happen? If let us say n and Id are close by, so none of its fingers, the successor 
of the start or actually between this range, then it is easy, then it means that the current node n, n 
is the closest preceding fingers. So, that is like the trivial case, and n can simply be returned. So, 
this is a very important operation, I would request you to kindly go over this part of the video 
several times until you understand all of it, and particularly the special case, which indicates that 


the finger nodes, all the finger nodes are after id. 


So, every finger recall is a very important concept. Do not forget, has the start has an end. So, that 
is the range of the fingers. But the most important thing that we are concerned about is the 
successor of start, which is node. And if you are unlucky the finger node can be after. And also, 
because there might not be any nodes within this range. This is an important thing that we need to 


bear in mind. 
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Now, here is the fun part. So, the fun part is that what is the complexity. So, when you are talking 
about the complexity of this algorithm, we really do not care about the number of operations we 
perform within each node. Because that is assumed to be extremely fast in any distributed 
algorithm. The only thing that we care about is how many messages are sent in that two 


sequentially. So, that is the only thing. 


So, we do not care about what a node is doing, that is assumed to be infinitely fast. See if that is 
the case, if that is the case over here, we need to understand how many messages are actually sent. 
So, let us look at this concrete example over here. So, consider this example of a set of nodes. And 
we are interested in id number 17. So, this can be the idea of a key, idea of a node, it does not 


matter. It is not that important. 


But here is the important part. The important part over here is that this is lying in the i'th finger. 
So, why do I say that because (3 + 2'~1) is basically 11. There is a node over there. So, this is its 
finger node. And (3 + 2%), so in this case, what is the value of I? well, it is i= 4. So, (3 + 16) = 


19. 


And clearly the finger node is the next finger, 1+ 1’th finger = 22. So, as far as 3 is concerned, the 
closest proceeding finger of 17 is node number 11. So, this is how we are going to proceed. This 


is an example. So, given that that is the case, can we derive some sort of a routing complexity. 
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For that, I will again take you back to my big board. So, where we will draw a big circle. So, I 
drew a circle as big as I can. So, here is a point. So, let us say I am starting at n. And this is where 
id is. So, in my estimate, on an average, so we expect that our nodes are uniformly distributed, 
homogeneously distributed across the ring. It is a massive ring, and I am using consistent hashing 


with a very random and uniform hashing function. 


So, I can say that, with a good certainty that over all my nodes are uniformly distributed. So, now, 
let us say that between n and id, I expect k nodes to be there. Then I find the closest preceding 
finger and let us say the closest preceding finger is at this point, and this lies within the i'th finger. 
So, now the key point is that, after this, I will start my search from this point. How many nodes 


are expected to be between this new point? 


Let us call it n’ and id. How is this related to K? that is the most important question. So, the distance 
between n and n’, if I were to measure it in just an id space, not in terms of the number of real 
nodes, the distance between n and n’ will at least be 2'~1, given the fact that it is within the i’th 


fingers. Furthermore, between n’ and id, so it is not, it is in the i'th finger, it is not an i+ 1’th. 


So, here, there is a little bit of a complication. But let us consider the simple case first, so let us 
consider that i+ 1’th finger will start from here, then you can clearly see that this distance is much 
more than this distance. So, I can say that the distance n to n’ is more than the distance from n’ to 


id. So, this is just the key distance, this is just the distance in the id space. 
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So, you can clearly see because if that was not the case, then the closest preceding finger would 
not have been i'th finger, it would have been i + 1’th finger. Of course, there is a little bit of a 
complication over here, which I will discuss later. But here again, I am considering the same simple 


case where within the i'th finger, we are finding it. 


So, if that is the case, we can clearly see that the distance from n dash to id measured again in the 
key space is going to be less than the full distance, which is the distance from n to the next finger, 
which is 2'. So, what is this, so the main idea is that the distance from n to n’ = > 2'~1, and the 
distance from n dash to the beginning of the next finger, and assuming that id is before that, this is 


bound to be < 2-1, 


So, I can write it over here. I mean, assuming that things are in control, so I will discuss when they 
will get out of control, not now. So, if that is the case, if this is the simple case, then what I can 
say is that even the i’th finger, if let us say I was considering a span of 2! entries with n, then when 
i start from n’, I will actually be considering a span off. Because n to id what was the maximum 


distance, it was 2'~! + 2'-1, which is 2°. 


Now, when I start from n’, I will be considering a smaller span in the key space, which is 2'~, 
which in other words is 2‘/1. So, with every closest finger operation, my span the area in the key 
space with that I am looking at that is reducing that is dropping by a factor of 2. But again, this is 


the key space. 


So, now that is a little bit of a problem, it is very well possible that instead of this point. So, let us 
say that this is where i + 1’th finger begins, and my id is over here. Why is id over here? Because 
it is possible that the finger node of i + 1 lies over here, 1 + 1. node lies over here. So, then what 


will happen is this logic will not hold. But we also realize that there is no node in this region. 


Had there been a? No, node in this region, finger i + 1. node would not have been here at the top, 
it would have been between the start of finger i + 1 and id, that has not happened. Given the fact 
that this has not happened, we are in a far better shape. So, what we can say is that look, this part 
is empty. So, as far as we are concerned, the distance from n’ to id if I were to just count the nodes, 
I can ignore this part because I am sure that there is no node over here because of the logic that I 


just gave. 
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So, I can still say that if I were to break this an expected number of nodes. So, if let us say that let 
me define a new distance metric, which is the number of nodes that lie between n and n’ expected 
number. So, let us call this let us say dn, between n and n dash will be some linear function of 


Pies 


We can also say that the distance in terms of the number of nodes between n dash and id is also 
expected to be some linear function of a number of some number, which is basically let us have 
some number x, which will not really be the distance between n’ and id in the key space, because 
we know that a large part of it is devoid of nodes. So, it will actually be some function of x where 


x is basically this distance, which is < 2'~?, 


So, using the same logic, regardless of where it is placed, we can say that even if we were to 
measure the distance in terms of number of nodes, and the number of nodes in a given region 1s 


proportional to the size of the region and the key space. Using that logic, we can say that look if 


: é k 
there were k nodes between n and Id, between n dash and id we expect that there will be < = nodes. 


And the reason that I say that is because we are roughly halving the distance if it is lying within 
the i'th finger, even if it is lying outside the i'th finger, regardless of where it is lying from the end 
of the i'th fingers till wherever id is, there are no nodes. So, as far as we are concerned, we are 


primarily still, the unknown region is limited to this starting of the next finger and n’. 


So, which is less than the other region. So, we can say that this is < 3° So, every iteration of clauses 


preceding fingered, we are reducing the search space of nodes by a factor of 2. So, this will 
immediately tell us that look, if we just keep on dividing it by 2, ultimately log2(N), searches have 
to be done in terms of closest preceding fingers. And this is the number of messages we will have 


to send over the network till we ultimately find the predecessor and consequently the successor. 
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Design of Chord Algorithm to find the Successor 


| O(log(n)) Routing Complexity 


Sm Saang Chad 
So, coming back to our presentation over here, the O(log(N)) logic is clear to us. So, it is not given 
that clearly in the paper, but you will have to consider all the sub cases and corner cases that I 
discussed with that it is very easy to see that it will be ordered log n and we also see this coming 


out experimentally as well. 
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_ Node Arrival 


Each node maintains a predecessor pointer 
@ Initialize the predecessor and the fingers of the new node. 
@ Update the predecessor and fingers of other nodes 
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Now, node arrival and stabilization, so, each node will maintain a predecessor pointer. It will 
initialize the predecessor and the fingers of the new node. It will update the predecessor and fingers 
of other nodes and then it will notify the software that the node is ready. So, in most 
implementations of Chord, we have a successor pointer as well as a predecessor pointer. So, we 
know which node is there on both sides, the successive node and the immediately preceding node 


on the ring. 
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So, how does a node arrive? So, let us say that this is the node n and it wants to join the ring. So, 
initially, it will contact one of the nodes that is already a part of the ring. So, this will be let us say 
node n’, which we know already it will, that it is a part of the ring. So, then it will join. So, it will 


ask n’ to initialize its finger table. 


So, there are two basic operations. The first is that for the node that is joining, its finger table needs 
to be initialized. Second, others have to be updated. So, we need to update others, it means the 


finger tables have other nodes saying that they need to point to me. 
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So, the init finger table this is how it works. So, n. init finger table n’. So, let us consider the node 
for the first finger, finger[1]. node. So, we will ask the node n’ to find the successor of finger[1]. 
start. So, that is pretty easy. So, we will just ask node n’ that you kindly initiate an algorithm and 
find the successor of my start. And that is done. So, that becomes my successor, so finger[1]. node 


as I said every node maintains a pointer to its successor. 


So, once the node n’ has done this basic service, and it has found the successor for the first finger, 
that is also the successor of the current node. And that can be stored in the successor’s pointer. 
Furthermore, the predecessor is actually the successor’s predecessor. So, what we can do is we can 
ask this successor node look who's your predecessor, and we can initialize that. Then we need to 


do a little bit more of math. 


So, we need to also record that the successor’s predecessor is the current node n. And this will 
require a message we need to tell the successor that look, I am your predecessor, and we need to 
tell the predecessor node that look, Iam your successor, the predecessor dot successor the current 
node n. So, this is no rocket science. It is similar to inserting a node in a doubly linked list, so this 


is no rocket science. 


Then what we need to do is we need to look at the fingers. So, for i<— 1 to m - 1, here is what we 
do for all the fingers. So, we look at the i + 1’th finger, say finger[i + 1]. start, so if let us say for 


the current node whose node is being initialized, if the start of its 1+ 1’th finger is between n and 
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finger[i]. node. So, what does this mean? That I am node n, this is my finger[i]. node, let us just 


call it maybe finger[i]. node. 


And let us say finger[i + 1]. start is over here. It is starting over here, then, it is very easy to see 
that finger i + 1’th node will also be the same node because from here, if we start walking down 
the ring clockwise, the first node we will encounter is finger[i]. node which is its successor. So, 
we can say that finger[i + 1]. node is finger[i]. node. So, this should be obvious, if this is not 
obvious, it means that the way that these fingers and finger nodes are created, that is not clear to 


you. 


So, if let us say you are not able to understand line number 4, the logic behind line number 4, the 
way that I explained, let me draw it again to make it slightly clearer. So, this is finger[i]. node. So, 
as I said, if this is not clear to you, kindly do not proceed. So, at this point, this should be absolutely 


clear to you, why I am doing, what I am doing. So, I am saying if this is finger[i + 1]. start. 


From here, I will just traverse the ring clockwise. And I should arrive at finger[i]. node and that 
should be the first node that should be my successor. Because from finger[1i]. start when I walk, 
this was the first node that I saw. So, if this case is holding, this will also be the first node that I 


will see. But as I said, it is very important to understand this line before you actually proceed. 


So, then, if this is not the case, if this case is not happening, then we will again have to request 
node n’ to find the successor finger[i + 1]. start for us. So, it needs to do the same again, it needs 
to again find the successor of this for us, and then we will use it to initialize. So, we just keep on 
doing it, we just keep on doing it for the rest of the (m — 1) finger. So, this will initialize the finger 


table completely. 
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So, now once my own finger table has been fully initialized, what I need to do is I need to update 
the finger table of others. So, what I do is I take a look at all the m fingers, for the i'th finger, we 
find the predecessor of (n - 2“), which means that if this is my current id, I just subtract 2'~1 
which means that I will lie in i'th finger or somewhere. So, from this point, I find the predecessor 


which is maybe this node. 


So, for this node, I am going to be the i’th finger. So, I go to this node pred and I will send a 
message and will pretty much say that look, I am now your i’th finger’s node. So, you kindly 
update your finger table and say that look node n is a node for i. So, what I would do over here is 
that I will call the update finger table function over here, which you see here at the bottom. So, I 


will do it for all my fingers. 


And each time I will call the update finger table function. So, when I am doing that, here is a check 
that Iam going to do, it is very important check. If let us say for node between pred. So, for its 1’th 
finger it will have node, it will have finger[i]. node. And let us say the current node n lies 


somewhere here. If it does not lie, I am not interested. 


But let us say it lies somewhere there. Then the current node n will be set as the i’th finger node 
for pred. That is point number |. Furthermore, there might be a need to cascade this information. 


So, it will look at its predecessor, it will then send a message to its predecessor that look, I have 
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gotten information that node n has been added, node n sent me a message, I found the message to 


have merit in it so added node n, which was lying between me and my previous finger[i]. node. 


So, I added it as the finger node for my 1’th finger, it might be a finger node for you as well. So, 
why do not you also kindly check. And if that is the case, then you also update yourself. So, then 
it will keep on cascading this to its predecessors until all of them update. So, this procedure is done 
for all the fingers. So, by this we basically ensure, so mind you, we only walk in a certain direction, 
which is anti-clockwise, because that is the way that we created our finger table. So, once the rest 


of the nodes have updated their finger table, the process of node addition is done. 


(Refer Slide Time: 53:34) 


Design of Chord 
Node Arrival and Stabilization 


| Stabilization of the Network (run periodically) 


1 n.stabilize() begin 

2/| xX < Successor.predecessor 
if x € (n, successor) then 

3 successor + Xx 


end pre 


successor. notify(n) 


sé 
} (Zz 
6 end ieee of 
7 n.notify(n’) begin \ 
8 || if (predecessor is null) OR ae (predecessor, n)) then 
9 predecessor + 1! Y 
1o|| end 
11 end 


Smruti R. Sarangi Chord 


So, then we will do a little bit of node stabilization. So, the node stabilization is required 
particularly if you have race conditions and so on. So, what I will do is I will look at my successor’s 
predecessor, so I will look at, if this is the node, I look at my successor’s predecessor. It can be the 


previous predecessor or it can be me. 


So, if let us say that my successor’s predecessor, so, this is essentially looking at this, this is me, 
this is my successor and this could be my successor’s predecessor, if that is the case and this node 
x is between n and the successor then the successor will be made x. So, this is one way of 
periodically fixing the thing. So, I check that am I my successor’s predecessor, if I am not, then is 


there a new node x that lies between me and my successor. 
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If it lies and still the network is in a position of getting updated because as I said, it is a race 
condition. If that is the case, I will simply set x to be my successor and I am done. And furthermore, 
I will notify the successor the new successor that look, I am your predecessor. So, this notify 


function will be called. 


And in the notify function, if there is no predecessor or the node that is claiming to be the 
predecessor is actually the predecessor, because it lies between me and my previous predecessor, 
then I set predecessor to n’. So, both these operations are what are called stabilization operations. 


So, they help to fix any temporary inconsistencies in a network. So, those issues get fixed. 
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So, then also, we can fix the fingers, so let us say that some error has cropped up somewhere. And 
because of that, there is a little bit of inconsistency in the bookkeeping information that is there all 
over the network. What I can do is I can fix fingers, which means that at random if let us i is a 


random number at random I should have a semicolon over here. 


So, finger[i]. node, the node of the i'th finger i can set it to find the successor of finger[1]. start, so 
this should hold anyway, but because as I said because of some temporary inconsistency that may 
crop up this may not be holding. So, this is where there is a need to periodically run the script and 


keep on fixing the network. 
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So, now we will discuss a thing or two about the evaluation set up. So, the network here consisted 
of 10,000 nodes, and the keys varied from 100,000 to a million and the experiment was repeated 
20 times, you can also see error bars in the figure. And they did not actually do it on a real data 


center on a real distributed system. This experiment was done on a simulated environment. 
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So, what we see is that the number of keys per node decreases with the number of virtual nodes 
and also why did we add virtual nodes, well we added virtual nodes for better homogenization and 


to ensure that no node becomes a hotspot. So, for one virtual node we could have up to 500 keys 
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per node with a mean of 100. And for 10 virtual nodes, we can have roughly 50 to 200 keys per 


node. 


So, what you can clearly see over here is that if you have less virtual nodes per physical node, then 
the lack of uniformity is large. So, from 100, it is going to 500. So, there is a lack of uniformity. 
But if I have 10 virtual nodes per physical node, the way that we have been arguing, you can see 
that we will only have this the uniformity is more or let us say the load is equally balanced across 


the physical nodes. 


So, go back to the slide where we discuss virtual nodes. So, here we have roughly 50 to 200 keys 
per node. So, kindly hang on to this concept of virtual nodes is very important. So, this will be 
used in a third part of the chapter when we will discuss commercial networks because they also 


use virtual nodes for exactly the same reason. 


So, this is used as a load balancing technique as a load balancing trick to ensure that even if we 
are using the best hashing algorithm, it will happen in real life, that some nodes a lot of keys will 
be mapped to it and for some nodes very few keys will be mapped to it. So, as you can see, you 
will see higher spread, if you want to reduce that have more virtual nodes per physical node, that 


will solve this problem to a large extent, as you are seeing over here. 
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So, the path length in Chord grows with a number of nodes. So, if you see the paper it will show 
that it is roughly normally distributed about the mean, so for a mean of six nodes, so mind you, for 
a similar network, the path length in Pastry was much lower, the path length in Chord is slightly 


more, so the + 30 range. 


So, is approximately if you look at the mean, it will the + 30 range does vary from | to 11. And 
the corresponding variance in Pastry was slightly lower. And but as you can see, it increases in a 
nice log N kind of fashion. So, it is around 2 for a 10-node network, 3 for 100 node, 4.3 for 1,000 
and 6.2 for 10,000. And so, in that sense, it is scalable, even though the overheads are slightly 


more than Pastry. 
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So, there are other DHT systems it is just not Pastry and Chord. So, for example, you have a 
Tapestry that uses a 160-bit block id with octal digits. So, the routing is like Pastry, it is a digit- 
based hypercube. So, it does not have a leaf set or neighborhood table. And also, it is not necessary 
that all designs will have a ring-based overlay like Chord and Pastry. So, in this case it is a 


hypercube, so it has a different kind of an overlay. 
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So, Kademlia is the basis of BitTorrent. So, each node has a 128-bit id where each digit contains 
only | bit. So, Kademlia will be the next lecture in this lecture series. And we find the closest node 
to a key. So, there is some amount of reliability in the network in the sense values are stored at 
several nodes and nodes can also cache the values of popular keys. So, that is also doable. So, we 


will discuss BitTorrent next. 


And one of the most influential proposals in this area is can content addressable networks, so we 
have discussed this earlier also where we use a d-dimensional multi-torus as overlay network. So, 
recall that a torus is basically a network it is a mesh with the corners connected. So, we have long 
connections of this type. So, this is of course 2-dimensional, we can have the same thing in 3d, we 


can have the same thing in n-dimensions. 


So, it uses standard routing algorithms for tori, tori is the plural of a torus and uses a virtual 
coordinate zone. So, whenever a node arrives, we split a zone and when a node departs, we merge 
a zone. So, essentially, we break this into zones and that is how we do the routing. But the can 
paper is available online. So, I would request viewers of this video to take a look at the can paper 


as well. 


So, these were the most popular DHT, so we started with Pastry, Chord, so we will have a short 
15-minute video after this to discuss BitTorrent and Kademlia. And that will heavily build on 


whatever has been taught in the last two lectures on Pastry and Chord. So, of course, TaPastry and 
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CAN and Kademlia are all there and there are some new papers. So, I just maybe write it down, 


there is something called Fawn, which is for extremely small nodes based on USB sticks. 


And but BitTorrent is clearly the most popular as of today, when it comes to openly available 
networks, and almost all major companies like LinkedIn, Amazon, Facebook, etcetera, they have 
their internal DHTs which store data. So, they of course, instead of a login time access, they have 


a one hop access, so we will discuss them in the third part of this course. 
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So, this was the Chord paper. And so, the next lecture will be on BitTorrent, but it will be very 
brief. 
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Welcome to the ultra-short lecture on BitTorrent. So, BitTorrent is one of the most popular file 
sharing programs peer to peer, network-based file sharing programs as of 2021. And this is based 
on the familiar technology of DHTs. So, this is an ultra-short lecture, but before you go forward, I 
would request all of you to take a look at the videos for Pastry and Chord, that are a part of this 
lecture series, because without understanding Pastry and Chord, this lecture will not be 


comprehensible. 


So, first, take a look at that. And we will only outline few of the basic points, salient features of 


BitTorrent, the rest will be fairly clear to somebody who understands distributed hash tables. 
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Overview 


Protocol 


@ Made to distribute large files. 


@ First create a Torrent descriptor file 
@ Details of the file. 
+ @ Acryptographic hash of the file's contents 
@ Stored and distributed to search engines. 


© The user joins a swarm of hosts. 
@ Each host is a simultaneous downloader and uploader. 
(@ |DEA : Break a large file into multiple segments (256 KB) 
@ Distribute the pieces to peers. 
{ @ The peers can subsequently re-distribute the pieces . 
@ A BitTorrent client can simultaneously download the differ- 
Jent pieces from different hosts. 
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So, overview, so as compared to Napster and Nutella that were typically for smaller files like mp3 
files and music files, BitTorrent was made to serve large files, or large video files, so it was the 
main aim was to distribute large video files. So, because of that, it is necessary to kind of re 
architect our system. So, the user first, the user who is sharing the file first creates what is called a 


torrent descriptor file. 


The torrent descriptor file has the details of the file, a cryptographic hash of the files contents. So, 


the cryptographic hash is required for integrity, because since we are talking of a large file, and 
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we will discuss how it is actually served, it is possible that some bytes may develop a fault. So, 
because of that a cryptographic hash is required. So, typically, the MD5 hash is used for this 


purpose. And then it is stored and distributed via search engines or via a peer to peer system. 


So, the user joins a swarm of hosts, it can simultaneously be downloaded and uploaded. So, we 
have been seeing the same format. In other P2P systems as well, we are seeing the same, we have 
seen the same in Napster, same in Nutella, that the user actually shares a shared directory where 
you have songs and videos and so on. So, since we are talking about large files, such as videos, we 


break a large file into multiple small segments, so each segment is 256 KB. 


And these are distributed to peers. So, these pieces of files are distributed to peers. So, this allows 
the client, the BitTorrent client, to actually download all of these segments in parallel. So, this 
increases the bandwidth and also reduces the time needed to get a file. And furthermore, it 
increases the robustness of the system. So, the peers can themselves redistribute the pieces. So, 


this will further add to the robustness. 


And pretty much for every file, we will number the segments, 1, 2, 3, 4, and so on. And we know 
how many segments that are. So, these segments will then come from different parts of the 
network. So, the BitTorrent client, which is a piece of software that every user needs to install, can 


simultaneously download the different pieces from different hosts. 
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So, the key elements in BitTorrent are like this one is a torrent file, which contains the metadata, 
metadata means a description of the file that will be used for searching, and the hash. And then we 
have a specialized entity called a tracker, which is a server. So, this used to be pretty popular in 


the early days of Bitcoin. So, the tracker was coordinating the entire process of downloading a file. 


This means that the approach would be to connect to the tracker, so the client would connect to 
the tracker server. This would have a list of peers that contain the different segments. And then the 
client would connect to the peers to get the different pieces. But now the tracker has gone away 
mainly because of legal issues, you do not want to have one server which has the list of the entire 


network. 


So, instead of the tracker, this has been replaced by a DHT. And the DHT will help you locate all 
the peers that contain a given piece using our same DHT mechanism that we are studied in Pastry 
and Chord. And the DHT that is used is called the mainline DHT which basically uses the 
Kademlia protocol. So, recall that in the last few slides of the Chord lecture, we did discuss that 


Kademlia protocol to a certain extent. 
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So, downloading and sharing files, well users need to use regular search mechanisms to find 
torrents of interest. Similarly, if a server has a new file, it will host it and distribute the torrent file. 


So, it is known as a seeder. So, once the client finds a torrent file, it will connect to the tracker, or 
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it will use the mainline DHT and will download the pieces in a random order. So, there is no fixed 


order. 


So, you can download piece three, piece one, piece two, piece four in any order. So, there can be 
different strategies. So, we can prioritize traffic for those nodes that have sent a lot of data on the 
network. See, if let us say that I have been a very active uploader. In a sense, I have been very 
actively supplying my files, I should get some priority while downloading. And also, tit for tat 
relationships in the sense of I gave you something then you will also give me something back with 


a high priority. 


And furthermore, I can reserve some bandwidth for myself and have some bandwidth for others. 
So, the main problem actually that happened with Bitcoin is that again, we go back to college 
students, so what they were doing is that they were sharing a directory and the directory used to 


have these files and their associated torrent files. 


So, all day others were downloading unbeknownst to the sharer, so that was eating up a large part 
of their bandwidth. And when they wanted to download, they did not have enough bandwidth. So, 
some of that, so modern clients are configurable, so some bandwidth can be reserved for oneself, 


and some for others. 
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Security and privacy, so as such BitTorrent does not provide an anonymity or security. And 
furthermore, the onus is on the site that indexes the torrents like a tracker site. Even without that 
everybody involved in the hosting and propagation of copyrighted or illegal material, in a sense is 
legally culpable. So, of course, to what extent it is enforced depends on the laws of the specific 


country. 


But there are two broad approaches. Either we use a tracker server that provides a directory or we 
use a DHT. And then the problem with the DHT is it will require multiple hops, but again, the 
legal liability is much lower. And furthermore, there is more robustness as well as it is easy to 
locate. And given the fact that will take proportionally much longer time to download the entire 


segment. 


Locating a note that has the torrent file will not take that much time. Plus, these are not strictly real 
time tasks. So, we do not really worry, need to worry about the latency to that extent. So, BitTorrent 
as of today is banned in a lot of places, particularly university campuses, regardless of whatever 


you are using tracker or DHT. So, that needs to be understood. 
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In spite of that BitTorrent is extremely popular. So, the mainline DHT, which BitTorrent uses is 


the largest DHT in the world, so it does have somewhere between 10 million to 25 million 
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connected computers. So, BitTorrent is clearly the largest file sharing system in the world at the 


moment. 


And all the current versions of the BitTorrent clients are compatible with mainline DHT but they 
can connect to trackers as well. Furthermore, BitTorrent is expanding, or rather, I would say has 
expanded and it uses other kinds of protocols. For example, it uses a gossip-based protocol. So, 
basically to synchronize BitTorrent directories, to implement BitTorrent directories among the 


peer nodes. So, this protocol is called Tribler. 


So, this is again, a gossip-based thing where I just maintain a directory of file names and servers, 
and we periodically exchange an update. So, go back to the lecture on epidemic and gossip-based 
algorithms. So, we use anti entropy to regularly extend the list of torrents. And furthermore, since 
there are lots and lots of torrents, the BitTorrent software also gradually learns about the user's 
preferences and filters, the torrents and essentially stores those torrents that are more aligned to 


the user's viewing preferences. 


So, this in a nutshell, was BitTorrent, we did not discuss much about the Kademlia protocol or the 
mainline DHT. But my feeling was that whatever we discussed was the end of code is enough to 
give an introduction to Kademlia. And the protocol of course can be read up on the web. But the 
main idea with BitTorrent should be clear that it is clearly the largest DHT in the sense that runs 


in the world and it uses other methods also that include gossip-based algorithms and trackers. 
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So, the BitTorrent Wikipedia article can give a quick introduction. If you want to know more about 
BitTorrent, you can always read this paper by Izal, Mikel, et al and it talks about 5 months in a 
torrent’s lifetime, so it will tell you everything about it. So, this lecture pretty much finishes our 
discussion on DHTs, we have discussed quite a few, we have discussed Pastry, we have discussed 
Chord, we have discussed, tapestry, Kademlia, one slide each and now we have discussed a system 


made on a DHT, the mainline DHT, the BitTorrent system. 


So, subsequently we will move to the second half of the course. So, the first part was essentially 
DHTs and epidemic gossip-based algorithms and so on. So, the second half of the course will 
basically look at distributed algorithms. And that is important because once we have DHTs are 
only one kind of a distributed algorithm, but there are many more types and all of them are 


required. 


And finally, we will use the results of parts 1 and 2 to create actual systems. We did see one actual 
system, BitTorrent is an actual system, but we will create bigger systems that use the results taught 


in parts | and 2 of this course. 
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Welcome to the chapter on synchronization. So, in this chapter we will talk about clocks, so 
we will talk about logical and physical clocks, because DHTs are not the only things that can 
be built using distributed systems and distributed algorithms, so for all other kinds of 
algorithms which we shall see in the second part of this course, we will see that there is a need 


for different kinds of clocking mechanisms to maintain time. 


So, we can either have physical time or we can have what is called logical time. In a distributed 
algorithm, logical time is clearly more important but physical time is also used, so we will see 


when what is used? 
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So, we will discuss what is called a synchronous system? This relies on physical clocks or it 
relies on systems where there is some notion of timing, and then we will discuss asynchronous 
systems that do not rely on any notion of timing. So, it basically means that the delay of a 


message can be very large. 


So, whether it is finite or infinite that again is a theoretical issue, but essentially you cannot 
assume anything about the time it takes to process a message or send a message, or the message 


transmission time that cannot be assumed. 
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So, we will discuss physical clocks first, so we will understand the way that you know the 
physical clocks work in our watches and within computers, so inside a computer also there is 
a Clock, so the basic idea is the same. So, the idea is that a quartz crystal is used to generate a 


clock signal. What is quartz? 


Quartz is a piezoelectric material which generates a voltage when subjected to mechanical 
stress, and when it is subjected to a voltage stress when in the sense when you change the 
potential across it, then also mechanical stress develops on it, vice versa. So, the basic quartz 
oscillator that you have over here, the basic oscillator can be represented with an equivalent 


circuit which is shown over here. 


And every quartz oscillator has a natural frequency, a natural frequency of oscillation, which 
is essentially the resonant frequency of this circuit. So, basically what happens is that there is 
a feedback mechanism; via the feedback mechanism every quartz oscillator would oscillate at 


a certain frequency which for it is the natural frequency. 
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So, what happens, as I just showed that there is a quartz oscillator that is part of a cell feedback 
loop, and there is, of course, an amplifier here to amplify the signal to properly embellish the 
signal, so the quartz oscillator here typically oscillates at 32 kilohertz, so that is the base 


frequency of the oscillator. 


Furthermore, this clock can be divided using a standard clock divider circuit and that can be 


used to create a much higher frequency. So essentially this is can be thought of as the time base 
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and then this time base can be used to construct a clock with a much much higher frequency. 
The clock drift is very little. It is +15 seconds per month, which is fine, so most of our watches 


do not maintain time to that extent. 


And the clock drift per se is not an issue it is only six parts per million, but nowadays of course 
quartz technology has improved by leaps and bounds, so it is getting much better by the day. 
But nevertheless, for large distributed systems, a regular quartz clock is not suitable, because 
even the 6-ppm error might not be either, might be on the higher side, so you would want a 


lower error. 
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So, we use a Caesium-133 atom as an oscillator, so this is very accurate. So, this also uses a 
similar feedback, mechanism feedback-based circuit as the basic, what is clock is accuracy is 
1078 ppm, 10-8 paths per million, which means it is extremely accurate, extremely-extremely- 
extremely accurate and this is used and this is what is typically there in what is called an atomic 


clock. 


187 


(Refer Slide Time: 05:11) 


Physical Clocks 
Synchronous Systems PENS 


| Use of Atomic Clock: GPS 


pchestnn Mi 
@ Each satellite broadcasts its position (x;, y;, Z)) and time t; 
@ The time is obtained through an atomic clock. 


Smruti R. Sarangi Logical and Physical Clocks 


So, now let us see that how the atomic clock is actually used, so we will come to the use of the 
GPS system. So, what happens is, that let us consider what is called satellite GPS? So what we 
actually use in our mobile phones is what is called a GPS? or assisted GPS? which means it 
triangulates based on the positions of multiple mobile towers; that is not exactly what we are 


looking at? we are talking about regular satellite GPS. 


So, let us consider a simple version of the problem, where there are four unknowns, so one 
unknown, of course, is the current time such, so, we do not know the current time very 
accurately, even the mobile phone does not know because it does not have an atomic clock. 


And of course, it is x, y, and z coordinates. 


So given the fact that the three unknowns here and one unknown over here, a total of four 
unknowns, we would need four equations, and the four equations will come when we have four 
independent sets of data, and these four sets of data would require four satellites. So, this is 


basically required for satellite GPS, and this is how time is obtained via an atomic block. 
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So, the basic physics is very simple, it is simply based on the propagation of light and the 
Euclidean geometry. So, of course there are some relativistic effects that also do come into this 
and that is of course required but I am talking of a much simpler formulation. So, if the current 
position is x, y, z relative to some reference, relative to some inertial reference frame, the drift 


between let us say the receiver clock and atomic blocks is d. 


So we are assuming that all the atomic clocks on the satellites, there is no drift between them, 
but there is some drift between the clock of the receiver which in this case is the mobile phone, 
and the atomic clocks. So, let us say the time at which the receiver receives the message, let it 


be “tr”, so we can set up a small equation. 


So, the small equation that we will set up over here is basically this is the distance, which as 
you can see this is this simple Euclidean distance, where Xi, Yi, and Zi are basically the 
coordinates in this case of the satellite. So, tr - ti is the time that we are reporting, so what 
happens is that whenever the satellite sends a message it affixes its timestamp along with the 
message, and when the GPS receiver receives the message, it looks at the current time, so it 


then subtracts them it is tr - ti. 


(x-x)?+(—-y)?+ (2-24)? =(Gr—-tit+d) xc 


And so, this is roughly the time it took for the signal to come from the satellite to the receiver, 
but of course since there is a drift in this, we add that, so this becomes the time. So, the drift in 


this case is an unknown, we do not know that, and so if you look at the previous slide, we were 
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assuming that the time the exact time is an unknown, but this has been changed to make drift 


the unknown. 


Mathematically, they are the same thing just a rearrangement of terms. So, given the fact that 
this is actually the time of propagation, we multiply that with the speed of light. And then after 
multiplying that with the speed of light, what do we get? So we have distance on the right-hand 
side, and distance on the left-hand side, so we have an equation in front of us. This equation 


has four unknowns, so what are the four unknowns? 


The four unknowns over here are x, y, Z, x y, Z in this case are the position of the receiver x, y, 
and z coordinates and of course the drift, so which is the drift in time. Xi, Yi, and Zi is 
essentially the position of the ith satellite, and “ti” is its time, which has been affixed with the 
message. And given that all the satellites haven't have an atomic clock, we are assuming that 


there is no drift between them. 


Now, given that we have four unknowns, we need four equations. So, since we have four 
satellites, each satellite contributes an equation of this type, so four equations that are quickly 


solved. 
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So, What we need to do is, we just set up these four equations and then all that your GPS 
receiver does is that it solves these four equations. There are many techniques on how to solve 


them, I am not getting into that but let us put it this way, it is easy to solve. 
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And, once it is solved we can also find the drift meaning, we can find the exact time, we can 
compensate for the drift and find the exact time on the receiver as well as we can find its 
coordinate x, y, and z. So normally, we do not use the z coordinate, so we can make a fixed 


assumption. If we know what is the elevation of a certain region. 


Then of course we can just have x, y, and the drift, so then we would need three satellites, 
which is most of the time done incidentally or if we really want the z coordinate let us say for 
an aircraft or something, we would need four satellites. So, this is of course a very exact 
mechanism of doing it, but let us say that if the z coordinate is not an issue, what we would 


need is that we would need to triangulate. 


So, then let us say that if this is not an unknown, we would need three satellites or three other 
places that can give us the exact time. So, normally what we have in assisted GPS is that we 
triangulate, so when we triangulate what we actually get is that we find the distance from three 


mobile phone towers and that gives us an estimate of the current location. 


So, this is normally what we refer to as GPS, and this as you can see is much cheaper, much 
cheaper than actually getting the coordinates from the satellite. And since most mobile phones 


are in a region with a signal, this is often easy to do it is easy to achieve. 
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Now, let us consider a slightly different protocol. So, in this protocol what happens is that there 


is no external time base like there is no satellite or there is no mobile phone tower, there are 
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just a set of machines and they want to ensure that between themselves they have a 


synchronized time. 


So, given that there is an external provider, they will have to somehow ensure that all of them 
are following the same time base, and this time base might not match with the real-world clock 
time that is okay, insofar as running a distributed algorithm is concerned, as long as all of them 


refer to the same time it is okay, use the same time base it is all right. 
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So, this is where we use the network time protocol, you would have heard about this in many 
scenarios so this is called NTP, normally. So, what happens is that we have a set of network 
time servers that have accurate clocks, let us call it the stratum ones. These servers synchronize 
themselves with an even more accurate clocks which are mostly atomic clocks, which are 
stratum 0, but these atomic clocks are typically not accessible to common users like us, so we 


would basically synchronize our times. 


If you go to the windows time setting or Linux’s time setting, it will give you an option to 
synchronize the time with a network time server, and so these network time servers are run by 
numerous agencies, and most universities also run them, major vendors also run them, does not 
matter, they are all in stratum 1, all that my machine needs to do is it needs to connect to a 


network time server and synchronize the time with it. 


Mobile phones also do that; you would have often seen that. Let us say you take a flight and 


land up in the region with a different time zone or let us say your mobile phone is off, you 


192 


charge it on the plane, and then when you land, what you would actually see is that the moment 


it gets its signal, the time changes. 


The time changes mainly because it synchronizes the time with the clock of the tower, so this 
is a stratum | synchronization. And client machines contact the NTP time servers, find the drift 
between the clocks, and update their clock. So, this is of course a simple example, but we look 
at slightly more complicated examples where we discuss advanced forms of clock 


synchronization. 
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So, what happens is that we will discuss the simplest algorithm first it is called the Cristian's 


Algorithm. So, here the client sends a request to the server at its local time ‘tl’. So, let us 
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consider the client and the server so it sends a request at its local time ‘tl’. The server receives 


it at time ‘t2’, which is its local time. 


Server then sends a reply at its local time ‘t3’, which the client receives the reply at ‘t4’. So, if 
we assume that the jitter in the network is 0, which means that this it took the same time to 
send the message from the client to the server, and it took exactly the same time for the reply 
to come back. If we assume that this jitter is 0, then the request and response take the same 


amount of time, what do we have? What we have is t2 — (tl + A). 


So let us say that when the time was tl] at the client, the actual time was (tl + A) with respect 
to the server. So, the delta here is a drift, so this is the time, this is essentially the message 
transmission time, so let us call it t-trans. Similarly, when the client received the message, it 


received the message at ‘t4’, which is the client's local time. 


So the server's local time would be (tl + A) — t3, where delta is the drift between the clocks, 
and this (- t3), where t3 is the time at which the server sent relative to its own clock, so this is 
the transmission time again t trans. And given the fact that we are assuming that there is no 
jitter in the network, which means it takes the same time to send and same time to receive, we 


have an equation over here., 


(t2—t1)+(t3-t4) 


And the value of A= which is just a simple algebraic reduction from here. What 


this essentially teaches us is that, if we just consider the client a client and a server regardless 
of any external time base, it is possible for the client to figure out the clock drift between it, 
between its clock and the server's clock, and then the client can adjust its local time such that 


the A= 0. 


So, in this case, even in isolation, a client and a server can synchronize their clocks using this 
simple algorithm which is called the Cristian’s Algorithm. Of course, the key assumption that 
is being made over here is that the jitter is 0, which means that the message transmission time 
regardless of the sender or receiver as long as the link is the same, remains the same, regardless 
of the direction. So then, at the end what happens the client shifts the clock by delta, and that 


sets their clocks to be the same. 
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Now, we will discuss the Berkeley Algorithm. So, the Berkeley Algorithm is an extension over 
and above the Cristian’s Algorithm. So, in this case, a master is chosen by some method among 
a group of nodes. So, let us say we have a group of nodes, one among them is chosen as the 
master, so the algorithm calls in master, but let us call it a leader, so they choose a leader. Well, 


how will they choose a leader? 


Remember, this we will discuss leader election algorithms, maybe one or two lectures later, 
but let us assume that there is some way for the nodes to select a leader which is one 
distinguished node among them, let us not look at failures for the timeline. The leader slash 
master will use the Cristian’s algorithm that we just studied to find the clock drift with each 


slave. So, what does the leader do? 


It basically finds the clock drift with each slave. We call them master and slave over here, not 
leader and follower. Then what the master does is that it computes the mean value of the drift. 
So, master then sends an update to each slave regarding the amount that the slave needs to shift 
its clock. So, what this basically means is that, once let us say you compute all the drifts, you 


compute the mean, write the mean of the drifts. 


So then, let us say this value be delta m, so then what happens is that all the nodes assume, that 
they would like to synchronize to a clock, which is delta m shifted from the master, so the 
master will make an appropriate calculation and send the amounts that they need to shift their 


clock by to each of the slaves including itself. 
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All the slaves including the master will shift their clock, so this will ensure that the clocks of 
most slaves and the master are relatively synchronized with each other. Furthermore, given the 
fact that we consider the mean, this will minimize the amount by which each slave needs to 


adjust its clock. So, basically major adjustments are avoided. 


Let me summarize this algorithm in a different manner, what will happen is that, if we have 
the master and we have a set of slave nodes, different slave notes will have different values of 
the drift. One option could be that, we say that look the master's clock is correct, all the slaves 


look this is your drift, you correct your clocks, such that you are totally in sync with the master. 


The other is that we say that look none of the clocks are correct, let us instead find the mean 
value of the drift, then let us assume that the correct value of the clock is essentially t master, 
which is the current tmaster + Am, let this be the correct value of the clock. Given the fact 
that we know that this is the correct value, the master as well as the slaves need to change their 
clocks, such that all of them, all of their clocks are roughly synchronized to this hypothetical 


value. 


And furthermore, the amounts that they need to actually change their clock is not much because 
the fluctuations across the mean are not expected to be great, are not expected to be substantial. 
So, this is one this is like a higher level algorithm, over the Cristian’s algorithm, that uses the 
Cristian’s algorithm as one of its components to find the drift between each pair, each master 


slave pair. 
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So, now that we have discussed these physical clocks, it is time to discuss a small algorithm 
with physical clocks or synchronous clocks, we will call it Totally Ordered Multicast. So, what 
does this mean? What this means is that, if there are a set of nodes, and let us say one node is 


sending a message to a subset of the other nodes. 


Let us say this is sending this message and this is sending, this other node one is sending these 
messages node, two is sending these messages, so then the order of in which the messages are 
received should be the same across all the nodes, in the sense it should never be the case that 


message A from let us say node | was received before message B in one of these nodes. 


And the other node records different order, which means message B was received first, and 
message A was received later, that should not happen. So, sending to multiple nodes is called 
multicast, but ensuring that there is an order for every message is known as total orderly. So, 


the algorithm here that we are discussing is totally ordered multicast. 
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So, the problem is as follows. So, the problem is that we have a system with multiple nodes, 
they randomly send messages to a subset of other nodes. So, it is important to note that they 
are not sending a message to a single node, but they are sending a message to a subset of other 
nodes. Furthermore, the network has a non-deterministic delay, however the delay is bounded 


by capital delta, and we need to ensure. 


So, there is also one more assumption that we are making, which I am not stating because it 
has kind of been flowing from the last few slides, which is that we are assuming that their 
clocks are synchronized. In such a scenario, we want to ensure that all the messages are 


delivered in the same order at all nodes. 


So, as we have just discussed it should never be the case that node A says that message X was 
delivered before message Y, and node B says that message Y was delivered before message X, 
so this should never be the case. So, this is not totally ordered, but let us see if all of them see 


the same order of delivery, it will become totally ordered. 


So, what we do is that, we timestamp every message with the local time. So, here of course, 
the assumption is that the local time is clock synchronized. For the receiver, for a message with 
timestamp t, all that it does is that it transfers it to the receive queue at time (t + A), so why (t 
+ A)? well the answer is very simple, so we are assuming that the network has a non- 


deterministic delay. 
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This is bounded by A or alternatively we can say that the network has a fixed delay but the 
clocks are not fully synchronized they have a skew of delta, so we can play it in different ways. 
But the important point is that, if let us say Iam receiving the message or let us say I am sending 
the message at time t, so every message is being time stamped, and let us say the sender is 
sending the message to a multitude of receivers, it is transmitting the message with its own 


timestamp which is t. 
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So, let me show this in a better way. If this is the sender, it is sending the message to a bunch 
of receivers. The message has a timestamp which is the sender's time which is t, so let us for 
the time being assume that all the clocks are synchronized, the only jitter is there in the network. 


Subsequently, at a later point of time, the receiver receives the message. 
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So, if let us say it waits for all messages, as you can see over here that if let us say it receives, 
it waits for all messages till the time, t + A. So, this will essentially guarantee that all messages 
sent before time t have reached it. So, what am I saying? What I am saying is that at a time t + 


A, the message is transferred to the receive queue. 


The property that I am claiming that all messages sent before have been received, correct, not 
only have they been received they have also been put in the receiving queue, that is exactly 
what I am saying? So, why am I saying that, well, the reason I am saying is that let us say 
consider another message that was sent at time t’, so let us assume, let us make a simplistic 
assumption if you want, I can break it later, where let us assume that the network transmission 


delay the minimum delay 0 and the maximum is delta. 


So, in this case what would happen is that let us say I transmit another message at time t’ and 
this is just before t, that is okay, so after delta time units which is the maximum, this message 
would have been received by all the receivers, and this would automatically guarantee that 
because of this maximum time which is delta, any message sent before it would also have been 


received. 


And then, so this is why when we add this message with time t to the receive queue at time t + 
A, we are sure that all messages sent to the current receiver with a timestamp less than t are 
present in the received queue, so that is important. This property, that all messages sent before 
t to the current receiver have been received and also have been placed in the receive queue, this 


is important, and this is directly a consequence of two things. 


One is clock synchrony, and the other is bounded network delay. So, it is a consequence of 
these two properties that this is happening. So, given that we have this, what we can see is 
subsequently we deliver the messages in the receive queue, based on the order of the 


timestamps. 


So, what we do is we take a look at the receive queue and given the fact that every message is 
time stamped to the timestamp, we deliver the messages from the receive queue to the process, 
so this is not a first come first serve queue, but it is rather a priority queue, and the priority in 
this case is the timestamp, so based on the order of the timestamps they are delivered to the 


receiver process. 
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So, the claim is that this is a totally ordered multicast, so we will see, why? So, the key idea is 
that let us go to the next. So, the key idea is that if you consider the receive cube, and if you 
just take a look at the time stamps, what we are saying is that you take the minimum time stamp 
over here, and let this minimum be let us say tm, tm in, and you deliver this message to the 


receiving process. 


So, let us assume that this is wrong in the sense that there is one more message which should 
be delivered, and this message has a lower timestamp, it has an earlier timestamp, but that is 
not possible given the fact that we just proved in the previous sheet, that all the messages with 
a lower timestamp are already present in the received queue, so it will never be the case that 


any message with a lower timestamp will not be present in the received queue. 
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So, let me write that, all messages with let us call it a smaller timestamp are already present in 
the receive queue. So, given that we have this property, we will essentially be delivering 
messages to the process in an ascending order of timestamps. So, this is clearly globally 


ordered, because we are clearly not missing any message. 


So we are not dropping in a message, here we are waiting for the message to come, not only 
the message to come all of his predecessors to come because of the delta property. So, we are 
not missing a message, this is not happening. Given the fact that we are not missing a message, 
we are looking at all the relevant messages that should be considered, and we are delivering in 
exactly the same order, which is the order across nodes which is exactly the same order across 


nodes, which is essentially the order of increasing timestamps. 


So, we are delivering in exactly the same order, hence given these two properties that no 
message is being missed, and we are delivering in exactly the same order which is an ascending 
order of timestamps. This proves that we have achieved a totally ordered multicast. Now, the 
problem is that this did make assumptions on bounded network delay, or let me call it bounded 


jitter, this did make assumptions about clock synchrony. 


So, these assumptions do not hold in practice for a practical distributed system, the reason being 
that we can have reasonably large indefinite delays in the network as well and along with that 
that we do not have clock synchrony. So, we need to create some method of a logical time and 
try to achieve the same thing, maybe not multicast but a slightly simpler problem, and we can 
of course extend this to multicast, and see how to do it with a logical clock instead of a physical 


clock use a logical clock, that does not make these assumptions. 
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So, given that we do not want to have a notion of global time, what we can do is that we can 
think of processes, so every process over a timeline has its own events. So, events can either 
be internal events or event can be as send and the other process can receive, so it can be a send 


and they receive. 


So, let us define a happens before relationship, if a process issues event a before event b in the, 
within the same process, then we can clearly say that a has happened before b. If event a is the 
sending of a message by one process and b is its receipt, receiving by another process, then 
also we can say that look event a — b, because it is not possible to receive event b without a 


being sent. 


So, these are our two happens before relationships, one holds within the same process, and the 
other holds across processes, and essentially same process we keep on incrementing our events 
and a send receive, the receive needs to happen after the send. Furthermore, transitivity holds 
in the sense if a — b, and b — c, we can say that a — c. So, the transitive property does indeed 
hold. So, we are trying to develop some notion of a logical time which is different from physical 


time. 
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So, now let us look at some definitions, so if it is the case that a has not happened before b, and 
b has not happened before a, so this of course can happen in a distributed system, it cannot 
happen in a physical system, but in a distributed system it can definitely happen, so let us say 
there are two events into on two different processes and though processes do not send any 
messages between each other, so then it will be the case that look a has not happened before b, 


and b has not happened before a. 


Say a and b are said to be concurrent, so we would use this symbol for concurrence. If a 
happens before b, then we can say that a causally affects b in the sense that there is a causal 
dependence a happens first a is the cause and b is the effect. So, let us create what is called a 
lamport clock, so let us assign a number to each event, so let us say for event a, let us assign a 


number to it and let us call it the lamport clock, it is an integer number, it is a natural number. 


So, we wanted to satisfy some conditions, the clock condition, that if (a — b) = T (a) < T (b). 
So, the clock condition can kind of be broken into two sub conditions, that if a happens before 
b, and a and b belong to the same process, then obviously T (a) < T (b), so we need to ensure 
that. Second, if a represents a send and b is its receipt then also, we need to ensure that the 
timestamp of the sender is less than the timestamp of the receiver, so these are the two things 


we need to ensure cl and c2. 
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So, reinforcing the clock condition is easy, it is not hard at all. Every process keeps a clock, 
keeps a clock which is a counter that is initialized to 0. So, let us call process size clock or 
processor is counter as Ti, ti. So then, so let us refer to Ti as ti as it is, I will not use the term T 
anymore I will call it ti. Each process increments ti between two successive events, so this 
ensures that clock condition cl, that whenever I have one event and then I have one event after 


this, if this is ti, I just increment it by 1. 


If event a is the sending of an event with the sending of a message, let me just change, that by 
process I then this process embeds its own timestamp in the message, so what it does is that it 


embeds its timestamp in the message, and let b be the receive event at process j. 


So, what we will do is, we need to ensure that the timestamp of b > timestamp of a, a is the 
send event and b is the receive event. So, what we do is that we compute a max operation 
between this time stamp which is a timestamp and b's existing timestamp. So, clearly the final 
timestamp of b has to be greater than both which is its existing timestamp as well as the 
timestamp of a, because it is a new event, so this needs to be the case, so we compute a 


maximum of them and then we add | because it is a fresh event. 


So, one of course comes because it is a fresh event, but the maximum comes because number 
1, we do not want to go back in time. So, given the fact that we do not want to go back in time, 
the timestamp of b will only increase, but the other thing is that if a has a higher timestamp, so 


let us say a's timestamp = 10 and b =5. 
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We first need to compute the maximum of 5 and 10 which of course is 10 and then add 1, 
because it is a fresh event, so this becomes 11, and 11 becomes the timestamp of event b as 
well as the timestamp of process j, as well as the time of the, current time of process j. So, this 
method, which is very simple essentially the two clock conditions cl and c2, the way that they 


are being enforced is extremely simple. 


So, for cl what we are doing is that any new event of a process, we are just incrementing the 
counter by | and for c2, what we are doing is anytime a sends a message, anytime a is the send 
event and b is the receive event, we just ensure that the timestamp of the receiving process is 
greater than its current timestamp, and the timestamp of the sending process and we add one 


because it is a new event. 


So, this does provide a partial ordering between the events but it does hold both of our clock 


conditions cl and c2 and this is called a lamport clock, a simple lamport clock. 
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So, it is called a scalar clock, it is a lamport scalar clock. So, what we have seen is, that if a 
implies b, it implies that the time of a is less than the time of b, what about the converse? if the 
time of a is less than the time of b does it imply that a happened before b. If this were the case, 
if this were the case, bear in mind, then if a and b are concurrent events, it would imply that 


their timestamps are the same. 


So, a and b are concurrent basically means, that a does not happen before b, b does not happen 


before a, given the fact that both of these relationships hold, you will clearly see that if a does 
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not happened before b, which this basically means that the timestamp of b is not greater than 
the timestamp of a, similarly, the T(a) < T(b), and so then they must be equal, but we will see 
an example where this actually does not hold, in this case, in this case of a scalar clock, this 


actually does not hold. 


So, consider event a and b in the same process, so what you can see is that event a and event c 
are concurrent, because they are in different processes, and there is no send receive interaction 
happening between them. Given that they are concurrent; by this assumption their time should 


be the same. 


Again, given that b and c are concurrent by this assumption, the time should be the same which 
leads us to the fact, that this and this should be the same, this is clearly not the case there is a 
contradiction, because these are events issued by the same process. So, what we find that, so 
this essentially proves my contradiction that the converse is not true, which means if a happens 
before b then, it is correct that a scalar clock would ensure that the T(a) < T(b) but the converse 


is not true which means that if let us say this is the case. 


Then it does not imply, that a happened before b this is clearly not the case, because if this were 
the case then this would hold and when, and we are clearly seeing that this relationship, for this 
example is not holding, given the fact that is not holding, we can say that in this case the 
converse is not true, it is not correct, so we cannot say that. If, let us say two scalar blocks, | is 
less than the other, it would automatically imply that there is a cause effect relationship it 


happens before relationship, this cannot be implied. 
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So, what we do is that instead of a scalar clock, we have a vector clock, this solves the converse 
problem for us. So, every process i has a vector clock, which is basically a vector of scalars, if 
there are n processes in the system, then we have n elements in this vector. So, within this the 
private clock of Iis also embedded, it is essentially the ith element. So, to ensure condition cl, 
whenever process I has an internal event, that would include a send or a receive, it would first 


increment. 


So, let us refer to this clock as vi, it would first increment the ith element, on any send receiver 
any internal event it will do that, and then it will send the message and in this case the message 
will be time stamped with the vector block, that is important, that is important to bear in mind 


that the message will be time stamped with the vector clock. 
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So, what we see here is that, if there are n processes every process maintains an n element array 
Vi. Process i increments the i-th element before sending or receiving a message or any internal 
event and every message is time stamped with the vector clock of the sender. Now, the sender 
sends a message to the receiver, so needless to say before sending the sender would increment 


its own private clock, so if it is the sender's id is I. 


It would increment the ith element of the ith clock, and after process j receives, it would first 
increment the jth element of its clock, which means its own private clock it will increment, 
within the array the jth element it will increment. But here is the key operation, the key 


operation is and I will maybe take you over there. 


So, the key operation here is that let us say if the sender is sending something to the receiver, 
it is attaching its vector clock along with it. Subsequently, the receiver is receiving it, it is 
updating its own internal vector clock and finally we have a sender’s clock, and we have a 
receiver's clock, appropriately internal values, appropriately incremented. So, we are looking 


at two vectors. 


So, what we do is that we compare them element by element, for every element pair something 
like this for every element pair we compute the maximum this is similar to what we were doing 
with the scalar clock we do exactly the same with the vector clock as well where for every 
element pair we compute the maximum and this maximum is used to compute the final value 


of the jth processes vector clock. 


So, this is shown mathematically over here that, what we do is that for all elements, we compute 
the maximum of process I’s clock and process j’s clock element by element maximum and we 
set it to the kth element or the final value of process j’s vector clock. So, instead of the 
maximum of two numbers, we just compute the pairwise maxima across the vectors similar 
corresponding elements across the vectors and the final value of the vector clock is just the 


result of the maximum. 


So, then how do we say that one vector clock is less than the other, so by the way before we 
say that the internal events this is ensuring clock condition 1, this is ensuring clock condition 
2, we will see why so we will need to see how because we have still not talked about how 2 
vector clocks are less than or more but we are essentially looking at clock condition 2, which 


is simple, which is straightforward it follows from the definition of maximum, that for all k, if 


let us say Vi < Vj = (VK, vi(k) < vj(k)) A Gk, vi(k) < vj(k)). 
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So, it means that for all values it is never the case, that any value at any cell is more than the 
corresponding value, in the corresponding cell of Vj(k), that is never the case, it is always less 
than equal to. But they are not strictly equal, they are not the same, which means there exists 


some k for which there, exists some k for which Vi(k) < Vj(k). 


So, there exists some index for which the kth element, Vi, kth element of Vi is less than the kth 
element of Vj. For the rest, it is less than equal to for at, but for at least one element it is strictly 
less than, or in other words I can say that it is less than equal to for all elements but the clocks 
are not equal. This would translate to the same thing which means there is at least one element 


where they are unequal and ith element < jth, i’s element, process i's element < j's element. 


So, this as you can see automatically guarantees clock condition c2, because the mere fact that 
we are computing a maximum will enforce this condition. And the mere fact that we are 
incrementing the receiver's clock, that would enforce this condition that you know, so receiver 
is expected to have the latest version of its own clock of its own counter, and given that that is 
being incremented, that would automatically enforce this condition, because there is no way, 


the sender would have the same value for the receiver, for the case when (k = j). 


So, there is no way that the sender would have the most recent value of this because the receiver 
just incremented it just before getting this message. So, at least for this element Vi(k) < Vj(k) 
or let us put it this way ViQj) < VjQ), which is exactly what we ensured by the increment. So, 
if we just look before every send or receive, we increment and this precisely ensures this 


condition. 


So, given that our clock conditions are satisfied, this does follow the definition of a logical 
clock, and now we can say that let us say if there is a chain of causality, let us say a happens 
before b, it can happen before b in two ways one is they are events in the same process, then 


automatically by clock condition 1, we have Va < Vb, so this is like clock condition 1. 


If a is a send and b is a receive, we also have Va < Vb by clock condition 2. Now the key 
question is, let us say, let us assume we have this does this imply that a happened before b. 
Well, so it does, so the thing is that it actually does, so the implication is both wise, so the 
original is true the clock condition is true as well as its converse is also true, and the reasons 


are not hard to actually think about, so let us prove this. 
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So, what we need to prove is that Va < Vb? so there are two events a and b and then we need 
to say that there is a path causality path of causality between a and b, so that is what we need 
to prove, because the other direction we have already proven. So, for this of course if a and b 
if there are no events in between, then it is easy to proof, so that is the trivial case, so let us 


consider the harder case where there can be any number of events in between. 


So, let us now consider, so let us say that this is process i and this is process j. So, let us look 
for event a, let us look at, so it is essentially event a was issued by process i, so let us look at 
the ith component of this. See of course, if this is O means that this is not an event, so it cannot 
be 0. So, given the fact that it is an event it has to be associated with a non-zero-time stamp of 
at least its originating process, so this clearly cannot be 0, but let us say this is some value 


which is greater than 0, and so let us say that the value of this is equal to x. 


So, given the fact that this property holds for process i at least which is the originating process 
of event a, its ith element is for sure greater than equal to x. So, now the question is how did it 
get to know? the only way that it would have actually gotten to know so this is a different 
process, this, so this is process j, and so this is a different process, so how would it how would 


you ever get to know something about another process. 


Given the fact that you can only increment your own private timestamp within the vector clock 
which is the timestamp corresponding to your own process id, you cannot increment any other 
process id, the only way to get information is if somebody sends it to you. So, which basically 
means that the only way that V(b) knows that the id of this, is x or is some value greater than 


equal to x is only possible. 
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If let us say there is some event in process I, where a happened first and then maybe something 
there are some complex series of interaction does not matter but then there was some event that 
set the value to whatever. So let us say the value of Vb(,) is x’, so it says, so it set it to x’, and 
then this value was again communicated could be via a host of processes it does not matter, but 


finally it was communicated to process j that look the value is x’ and clearly x’ > x. 


So, given the fact that x’ was communicated to it there has to be a causal chain of events within 
the same process and sends and receives that is the only way that event b that is happening in 
process j, actually knows about the value of the clock in process I, and in particular its ith 


component. 


So, it at least has the value x which a is recording or some value which is more than that which 
can only be possible if it was explicitly communicated to j maybe directly or via other processes 
it does not matter, but nevertheless there has to be a causal chain otherwise j will simply not 
ever get to know, and given the fact that j knows that a causal chain does exist and this does 
prove that if one vector clock is less than another vector clock, you can establish a causal chain 


of happens before relationships. 


So, given the fact that we have actually proven both directions, so we have proven that if let us 
say there is a causal chain the vector clocks one is lower than the other and if let us say Va < 
Vb, there is a causal chain between the events a and b, we have proven both directions so we 


can have an if and only if kind of implication, implication both sides means it is equivalent. 


Now, given that it is a vector clock, so let us say you have two processes, so one vector clock 
could be 3,5 and the other can be 5,3 and this would basically indicate that Va is also not greater 
than equal to Vb. So, there is no less than equal to relationship and there is no greater than 


equal to relationship in the sense that these are not comparable. 


Given the fact that these are vectors, in a sense these vector clocks are not comparable. So, 
since the vector clocks are not comparable, in this situation we will say that the events a and b 


are concurrent. So, this is the definition of concurrence in the world of vector clocks. 
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So, what vector clocks have actually given us is that as opposed to scalar clocks, if we just 
implied that look, if there is a causal relationship I can say something about the scalar time of 
a and the scalar time of b, it is less than vector clock said that look it does not matter, you can 


infer causality, you can infer causality from the values of the vector clocks themselves. 


See, if let us say one vector clock is less than the other vector clock, you can infer causality 
and if there is causality then of course Va < Vb. So, it did solve the problem of scalar clocks to 
acertain extent but it is a, its overheads are more, we need to store more and we need to transmit 


more. 


Now, given the fact that we have seen scalar clocks as well as vector clocks, let us look at an 
algorithm that uses the lamport scalar clocks, and we will call it totally ordered mutual 
exclusion, which is a different from totally ordered multicast even, though there is a total order 


over here across the processes. 


So, we will solve this using a simpler variant of our clocks and we will show that in a world 
with non-deterministic network delays which can be very large. In a world without clock 
synchrony also we can achieve a lot just by using our notion of time which are scalar clocks 


and, in some cases, we will also end up using vector clocks. 


But we will still make one key assumption, which is the fact that we have FIFO channels, which 
means that if a sender is sending data to a receiver and let us say it is sending two messages, 


let us call them ml and m2, if let us say m2, m1 was sent first, and m2 was sent later, the 
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messages will never get reordered in the network, in the sense that receiver will receive m first 


and m2 next. 


And, so basically we will think of these as FIFO first in first out channels where the net network 
can delay messages and the delay can be in the indefinite, but it will never reorder messages, 


so this FIFO channel assumption, we will always make. 
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So, let us now look at the mutual exclusion problem. So, first we will consider what it means 
for two events a and b to be totally ordered. So, we will go back to our definition of lamport 
scalar clocks, so we will not consider the vector clock but we will consider the scalar clock. 
So, what we have said is that the scalar clock has limitations, but as far as we are concerned, 


we will try to derive some meaning out of it. 


So, we will say, we will say the following of course, this does not this is not completely in 
consonance with the definition of a scalar clock, but here is what we will say. What we will 
say is that if let us say a is a send and b is a receive, then of course the clock conditions will 
hold, and we will further also say that look as far as we are concerned, the converse is true even 


though we know it not to be true. 


But let us assume just for the sake of discussion, that we say that look, we order all the events 
by their scalar clock time, and if two events have the same scalar clock time, then we order 


them by the ids of the, we break the ties by the ids of the processes. So, this is kind of an 
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alternative definition of a scalar clock and it does not, technically speaking that we are talking 


about the converse and technically speaking, it does not hold the way we have defined it. 


But let us assume that we have a lamport scalar clock, and the for the purpose of, just for the 
purpose of ordering, just for the purpose of ordering events I repeat, we say that look even a 
appears to be before event b. If let us say the clock of a < b, and furthermore event a appears 


to be before b, if their clocks are the same but process i is less than process Jj. 


So, this is by the way also called lexicographic ordering, so this is like imposing a total order 
on events when it actually does not exist, but let us assume that we do it and we still use the 
scalar clock mechanism that we have. So, what are we going to do with it? So we will solve 
the ordered mutual exclusion problem. So, mutual exclusion basically, I will give you a real 


world practical example first, and then I will talk about it. 
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So, let us say that we have a system where we have a set of nodes, so the nodes can be desktops, 
laptops, does not matter and then there is a shared printer. So, as far as we are concerned only 
one of the machines in the network can access the shared printer and it can print a page, so this 
is a resource which cannot be acquired by two machines at the same point of time, it cannot be 
used concurrently, and furthermore once it is being used, the use has to finish, then only the 


next use will start. 


So, we can say that different machines will have different usage requests they will somehow 


coordinate between themselves to ensure that the property or mutual exclusion as I just defined 
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it, this holds in the sense only one request can use the printer at a given point of time, it is 
important that this holds. Furthermore, we will use the mechanism of lamports scalar clocks to 


create some protocol, where the scalar clocks are incremented the way that we have described. 


So, the aim is to create a distributed protocol, we will not assume clock synchrony, so these are 
the things that we will not assume, we will not assume a bounded network delay. These two 
things will not assume, but of course we will assume FIFO channels, that is important between 


any two nodes, we will assume a FIFO channel. 


So, with this in mind, let us go back to our mutual exclusion problem, it says a certain resource 
can be owned by only one process, and it has to be explicitly granted and released. Different 
requests must be granted in the order in which they were made, which basically means the 
order by which this total ordering that we have imposed, so this total ordering that we have 
imposed different requests are made in that, and right so they are consistent with this total 
ordering, and if no process hangs forever after taking the resource, then every request is 


ultimately granted, in the sense the protocol is fair. 
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See, if I were to consider this, so let us look at the lamport's algorithm for the mutual exclusion 
problem. So, the mutual exclusion problem is very generic, so in any distributed system 
ultimately if let us say 10 nodes are going to coordinate and do something, then they will have 
to, you know whenever there is a shared resource or they or they want to, let us say access 


something, so mutual explosion is mainly for access, in the sense they want to access something 
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where only one machine or one process can access at a single point of time, then of course we 


need this. 


And it is a generic mechanism, because let us say that you want to run some other algorithm or 
let us say you want to centralize the algorithm and you want one machine one process to kind 
of takeover, then also such kind of mechanisms are required. So, we will discuss mutual 
exclusion then these algorithms will naturally lead us towards what is called leader election, 


where out of all the processes we can elect a leader. 


And the leader can do something on behalf of the rest. So, we are pretty much going in that 
direction but as I said mutual exclusion by itself is a very worthy problem to solve, because in 
many a time in a distributed environment, we do have shared resources, and hence the mutual 


exclusion property is required. 


So, what is our model? our model is very simple, the model is that we have n processes, and 
okay. So, the model is that we have n processes with us and we have a single resource that 
needs to be requested for. So, the n processors compete for it and let us assume that a subset of 
them are interested. So, to request for a resource, this is what needs to be done. To request a 


resource Pi, process i sends a message and the message is (Tm, i). 


So, what is tm? Tm is essentially the lamport scalar clock of the ith process, so it does whatever 
incrementing etcetera it needs to do and it sends Tm and also it sends its own process id which 
is 1. So, this tuple over here is what is sent, so this is sent to the rest of the processors, every 


process will send this tuple to, the rest of the processes which is (n — 1) other processes. 


So, when process Pj receives this tuple, well, so the first thing is that in any communication 
that involves scalar clocks, it increments, sees the value and it increments its clock. Using the 
operation that we have seen which is c2 in the sense it takes max first and then increments, it 


takes a max, it takes a maximum first the maximum of the clock of i max of Tm. 


And let us say Tm is d message which is clock of i and then its own clock which is j and then 
it adds one which is what we have seen and after that after incrementing its local clock it just 
puts the message in its request queue. Furthermore, it sends a timestamp acknowledgement, 
and the acknowledgement basically says that look this is an acknowledgement, it is coming 
from process j and this is the internal time of process j, which has been incremented after 


receiving the message. 
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So, you receive the message as per your lamport clock, you take a maximum increment and 
you send a timestamp acknowledgement. So, this is the first round of the algorithm where a 
message is sent to n minus one other nodes. Now, to access the resource, what needs to be 


done? is that we will access the resource when these two conditions are met. 


So, what we will do is that, when this message is the earliest message in the queue, so the 
earliest by the time, in the sense that we look at all the messages in the queue, and so who looks 
at it, ith process looks at all of the messages in its request cube and let us say Tm 1 is the earliest, 
which means either Tm is the lowest time stamp or if there are other ties for the lowest time 
stamp, then it will break the tie by the process id and if that if i is the smallest, then it becomes 


the earliest message in the queue, so this is condition 1. 


Condition 2 is that the process has received a message with timestamp greater than Tm from 
every other process. Every other process has basically sent a reply to i, so it could be another 
message as well from every other process, so every other process has basically sent it a message 
with a timestamp which is greater than the timestamp of the message, which is Tm. See, if both 


of these conditions hold, then we claim that it is safe to access the resource. 
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Then, what is the protocol for release? Well, the protocol for release is that after the resource 
has been accessed, then what happens is, that Pi scans its own request queue, removes the 
(Tm,i) the message that it sent, because recall that it sends a message to everybody on the 


network (n — 1) nodes, and also adds the message to its own request queue. 


So, this own message over here is removed. Furthermore, it sends a stamped release message 
to all the other processes, to the rest of the processes. It sends a release message which basically 
says that I am releasing my current claim, so this request henceforth stands released. So, when 
process Pj receives a release message from process 1, it removes any request from process 1 in 


its request queue. 


So, this basically ensures that the original message that it had sent to let us say process j which 
was (Tm,i), so this message once the release message is sent the message is removed. So, the 
protocol is very simple, the protocol as such is not hard, so let us go back and discuss this 


protocol for a second. 
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So, the protocol as it sounds is very simple, we have let us say process I, if it wants to request 
then it sends a time stamped message. So, let us say it is its own personal timestamp (Ti,i), but 
given that Ti represents, we can use the term Tm for the message, so it sends it to everybody, 
the rest of the nodes. Along with that, it adds the message to its own request queue over here, 


and so these are (n — 1) messages. 


Subsequently, each of them they increment their internal lamport clock and they send a reply 
back, they send a reply back. So, this is the second round where again (n — 1) messages are sent 
which are all the acknowledgements, all the acts. Subsequently, what happens is process 1, 
scans its request queue. So, after scanning its request queue, here is what it does, so it takes a 


look at all of the messages that are there in its request. 


So, it just keeps keeps on doing that, so within its queue it tracks the message that it had sent 
for getting access to the resource. The moment this becomes the earliest message, earliest 
message means this is the one with the least time stamp, and if there are multiple messages 
with the same Tm timestamp, i will be the smallest, so that is the way we break ties, so it waits 


for the time when this becomes the one with the least time stamp. 


So, this is condition | and second is it should have received a reply, from every other node, 
from the rest of the (n — 1) nodes as you can see over here, it would have received a reply, a 
message with a timestamp greater than Tm from every other process. So, every other process 


would have sent it a message greater than Tm, and so once the we can say that every other 
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process sent a message with timestamp greater than Tm, it, so it is not greater than equal to it 


is greater than Tm. 


Once both these conditions hold, we claim that it is safe to access the resource, and we also 
claim that concurrent accesses to the resource are not possible, so this is the claim that we are 
making. Once this is done, nothing much needs to be done, you just throw out that original 


message from here and then again send another message called a release message. 


So, the phase three is that i again sends a release message to the, all the node, saying that look 
you kindly remove the message that I had sent which was (Tm, i) from your internal queues, 
so from your internal request queue. You will have the (Tm, i) message, you kindly remove 


that, so this is again (n — 1) messages. 


So, as I had said in a distributed algorithm the complexity is measured by the number of 
messages that needs to be that are sent. The reason being that internal computations are very 
fast and network messages are expensive. So, the total number of messages we need to send at 
3(n — 1) = 3N for large n of course. So, hopefully the protocol is clear now by 2 to all of you, 
the important thing is why does it work, so the proof is interesting, so that is the important 


thing, why does it work. 
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So, we will discuss the proof shortly, so the main will first, discuss the main idea, then I will 
take you into the details. So, the main idea is if the resource is free, so let us say nobody else 


is accessing it, then immediately everybody will get the request, they will send an 
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acknowledgement, and then the original process will immediately start accessing the resource, 


so that is clear. 


What we are claiming is no two processes can get the resource at the same time, and further 
more processes get requests in order of the request, the order is the one, the lexicographic order 
that we are imposing, the order that we are imposing on the scalar timestamps. So, the 
discussion is, if a process is getting a resource, then there are two possibilities. What is the 
possibility? it has seen requests by all other processes, it has not seen the request of some set 
of processes but it has seen messages that precede them. So, let us go into detail of what exactly 


these mean. 
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So, we are now trying to get into the depths of the proof and this is the last side. So, let us now 
go into the details of the proof, so assume we have two processes 1 and j and they are accessing 
the resource at the same time, so this of course is not possible, but let us assume to the contrary 
that it has indeed happened. So, we are talking of two processes i and j and let the requests that 


led to this simultaneous access have time stamps Ti and Tj, respectively. 


So, with no loss of generality, let us assume that Ti < Tj which means that we break the tie by 
the process i. So, even if the lamport clocks are the same, we break that tie by the process id 
but regardless of how we do it. The important point is we have a way of totally ordering the 
tuples which is the lamport clock timestamp, process id, we do that and it turns out that Ti < Tj 


with no loss of generality. 
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So, what would have happened, this means that j must have gotten a message from 1 with a 
timestamp greater than Tj before acquiring the resource, otherwise by the second condition, j 
could not have acquired the resource, so it must have sent the messages etcetera, all of that 
would have happened, but then I must have sent it a message with a timestamp, which is greater 


than Tj. 


If that is the case, then what about the original request, that 1 actually sent. So, what is 
happening? j is sending a message with a timestamp Tj, i is sending a message back with 
timestamp let us say Tx, tau Tx and we know that this relationship holds, otherwise j could not 


have accessed the resource. 


Now, what about Ti, when i sent its original message? so this could not have been sent after 
this event, otherwise Ti would have been greater than Tx which would have been greater than 
Tj, but that is not the case. So, then this basically means that i sent its message to j, before 
sending this message, so before that i actually sent Ti. If 1 did that, then let us look at the fun 


part, so let us go over here. 
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So, what we are saying is that we are considering the moment in which j actually entered the 
critical section, that is what we call in programming terminology or access the resource. This 
essentially means the following, it means that at some point of time j would have sent i a 
message Tj, i would have sent a message back at some point of time let us call it Tx, for j to 


have access the resource, Ti > Tx > Tj. 
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Now, what we are arguing is that the original message with i sent to j, could not have been after 
sending this message because otherwise i would have had a greater timestamp, which would 
have been greater than Tx > Tj, but we know that this relationship is not true, because we know 
that this holds by our assumption. Hence, it means that i would have sent a message to j prior 


to sending this message, which is Ti which contains its request. 


This further means that when j took the decision to actually acquire the resource, it would have 
had the message from 1. So, this is an important thing, let me repeat, bear this in mind, the 
moment that j took the decision to acquire the resource, it is guaranteed that it would have had 
the message from i, and given that i and j are accessing it currently, this message would not 


have been released. 


Now, the question arises why did j access the resource it was not supposed to, because, so let 
me write it, it was not supposed to access the resource, the reason is that in its queue the request 
from itself was not the earliest request there was already a request from i and that was the 
earliest request. Consequently, j should not have accessed the resource, hence what we observe 


is there is a contradiction. 


So, our assumption that two requests are actually, they have actually acquired the resource at 
the same point of time this is not correct, this has led to a contradiction, so this is not possible, 
and given the fact that a wrong thing is not possible the right thing would be holding all the 


time which means that mutual exclusion actually holds. 
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So, this is what it means, that mutual exclusion holds and the algorithm is correct. So, what did 
we do in this lamport's algorithm, we sent 3 (n — 1) messages and for sure without assuming 
clock synchrony or bounded network delay mutual exclusion is guaranteed. And the reason it 
is guaranteed is because of the way that we constructed our lamport clocks, so this is the key 
point over here that I am trying to make is that given the fact that we said j has to wait for a 


message greater than its timestamp from i, i would have sent its request message prior to that. 


Given the FIFO channel property that would be there in j’s request queue, so you might want 
to hear this part of the video again. The message would be there in j's request queue, so there 
is no way on earth that j could have accessed the resource without i releasing it. So, concurrent 
access is not possible consequently mutual exclusion is guaranteed. So, what we will do in the 
next lecture is look at many more algorithms that really, that reduce the number of messages 


that are required to guarantee mutual exclusion. 
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So, the original paper on time blocks and ordering of events was a seminal paper in distributed 
systems, pretty much something that started the field, this was published in 1978. You can read 
up this paper, so this will give you a lot of insights and understanding about the original ideas 


that came to define the field. 


225 


Advanced Distributed Systems 
Professor Smruti R. Sarangi 
Department of Computer Science and Engineering 
Indian Institute of Technology Delhi 
Lecture 8 
Distributed Mutual Exclusion Algorithms 


(Refer Slide Time: 00:18) 


Mutual Exclusion 
Tokenless and Token Based Algorithms 


Smruti R. Sarangi 


“7 MG all Ibs Department of Computer Science 
i Indian Institute of Technology 


New Delhi, India 


LOimp or ¢ aMgpr thm 
4 


Smruti R. Sarangi Mutual Exclusion 


So in this lecture, we will discuss mutual exclusion algorithms. So, this will be our full- 
fledged lecture in which we look at distributed algorithms per se. So, we will extend the 


basic lecture that we had discussed last time. 


So, the key prerequisites would be logical clocks, so, Lamport clocks, both scalar Lamport 
clocks as well as vector clocks. The next important prerequisite here would be Lamport's 
algorithm for mutual exclusion, of course, assuming our asynchronous setup where we do 


not have clock synchrony. 
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All so, in this lecture we will discuss two kinds of algorithms. One is without tokens, so 
we will discuss the Ricart-Agarwala Algorithm, and the Maekawa’s Algorithm. Then, we 
will discuss token-based algorithms as a part of this slide set, where we will discuss two 


algorithms, again, the Suzuki-Kasami Algorithm and the Raymond's Tree Algorithm. 
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So coming to the Ricart-Agarwala Algorithm, so again the key prereq is to understand the 


Lamport's Algorithm that was a part of the previous lecture. So in that algorithm we said 
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that we would require 3(N — 1) messages for achieving mutual exclusion, and we had 
proved that it is correct. So now I am just extending from there, and of course, the prereq 


is that that algorithm is completely understood. 


So, that algorithm has three rounds. So, where we send a request, then we send an ack, and 
then we send a release. So, the question is that is a time stamp to reply or an 
acknowledgement necessary. So, that is the key question. And, so, the main idea is that 
instead of these 3 messages that I will send to every single node, can I make 3 to 2? Then 
instead of 3(N — 1) messages, I will have to send 2(N - 1) messages, which is an 


improvement. 
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So, the idea was that in the Lamport’s algorithm, we send an acknowledgement 
immediately because we have to show that another node has recorded the message and it 
is sending an acknowledgement with a greater timestamp. So, the key insight over here is 
that we hold on to the acknowledgement and we piggyback the acknowledgement along 
with a release message. We will see how. So, this will help us reduce the number of 


messages from 3N to 2N. 
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So, here is algorithm. Process i sends a timestamp request message to all other nodes. So, 
we had seen this. So, this part is the same. To all the other nodes, the rest of the other nodes, 
it sends a timestamp request message. When Process j receives a request, it sends a reply 


if. So this part changes. 


The first is P j is neither holding the lock, nor is it interested in acquiring it. So, if it is not 
interested at that point of time in acquiring it or if it is not holding it, it just sends a reply 
and the reply would follow the same rule for scalar Lamport clock, so its timestamp would 


be higher than the request. 


Or second condition, P i’s request timestamp is smaller than P j's request timestamp, and P 
j is not holding the lock. So this basically means that P i made an earlier request. So even 
though P j is desirous of acquiring the lock, so, lock means the shared resource. I will often 
use lock in place of shared resource. So this, kindly understand, because in the concurrent 


systems world any shared record, any shared resource is actually called a lock. 


So coming back to our discussion, the thing is that look if P j is not holding the lock or not 
interested, it will send a reply, or if it is interested, it has send out a request but P 1’s request 


happens to be smaller, the way we defined it in the previous lecture, than P j’s request, then 
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also it will send a timestamped reply. So this part changes, and this part is crucial. So this 


key idea is that P j will send a reply only if both of these conditions hold. 
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So now, when you acquire a lock, well you acquire the lock when a process has received 
(N — 1) replies. So let me just go back. So, what would happen is that P j will not send a 
reply as long as these conditions hold. So, let me put it in a different way. It is not that P j 
is going to junk the request from P i. Process j is not going to do that. What Process j will 


do? Pj will do, is that it will hang on to P i’s request until one of these conditions is true. 
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Subsequently, it will send a reply and once a process P 1 has received the (N — 1) replies, it 
knows it can acquire the lock, so it will acquire the lock and finish its job. And after, so 
releasing the lock basically would mean doing the same, which means replying to all the 
pending requests which the request it has kept pending, it has not replied to, it will reply to 


them. After it releases the lock, it will reply to all the pending requests. 


So, one thing that is obvious over here is that instead of sending three messages to every 
node, we are sending two. So the release is not there. The ack and the release, we are fusing 
into one message. So instead of sending three, we are sending two. So the key idea over 
here is that I send a request, other nodes simply hang on to my request, keep it pending 
until one of these conditions holds true. Once they hold true, it holds true, they send a reply 


and they get rid of my request. So then cleaning up the state is not an issue. 


And once I get all my replies, I enter, and subsequently, if I have kept, if Process i, which 


is me, if I have kept other requests pending, I will reply to them. 
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So, this is a simple algorithm. What about the proof? So, hopefully, the algorithm and 
understanding is 2(N — 1) has not been an issue. But again, I would like to say that the 
Lamport's algorithm is a prerequisite. Kindly not go through this unless you have 


understood that thoroughly. 


So again, let us prove by contradictions. So assume that processes P i and P j both have 
acquired the lock at the same time. With no loss of generality, same approach, assume P 1 
has a lower request timestamp. This means that P i must have gotten P j's request after its 


request. That is obvious. 


Why? Because the thing is that if let us say P i’s timestamp is T i, then it could not have 
gotten P j's request, because otherwise, T i would have had a higher timestamp and this 
relationship would not have held. Given the fact that this relationship is holding, it means 


that P 1 must have sent its request, must have gotten P j's request after it sent its own. 


So, then this means, according to this algorithm, the moment Process i got the request from 
Process j, it had already sent out its request. Furthermore, its request timestamp was less 
than the timestamp of Request j. Consequently, it is not possible via this condition. First, 
it was interested in acquiring the lock and second, of course, this condition does not hold 
but for this condition, when it was interested, the request timestamp of P j was higher than 


its request timestamp, hence it could not have sent a reply. 
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Given the fact that Pi could not have sent a reply to Pj, Pj would have definitely not gotten 
a reply from i because i was the one with the earlier request and the one with the lower 
timestamp. Consequently, P i did not send a reply to P j. So clearly P j’s conditions that it 
would have gotten a reply from everybody would not have held. Hence, P j could not have 
acquired the lock. Given that P j could not have acquired the lock, we have a 


straightforward contradiction over here. 


So, I am not explaining this proof once again because this proof, if you think about it, is 
extremely simple. But I would request the viewers to keep replaying this part of the video 
where I explain the proof until they understand it thoroughly. Let me explain it in another 
way. The key idea over here is that unlike the Lamport's algorithm where we send a reply 


immediately, here, we hang on to the reply. 


So, the argument that is being made, the same argument was made over there also that there 
is no way that 1 would have received, Process i would have received Process j's request 
before it sent its request. Otherwise, by the law of scalar clocks, it would have had a higher 


timestamp. That is not the case, which means, it receives after it. 


And if it receives after it, it already has its own request in its queue, which has a lower 
timestamp. So reply could not have been sent. If a reply could not have been sent, there is 
no way that Pj could have entered, could have acquired the lock, I was about to say enter 
the critical section, but again that is also not wrong because that is also another way that 
we refer to mutually exclusive objects in concurrent systems, but I would prefer not using 


that. 


So, but the key idea is, P j could not have concurrently entered the critical section or 
acquired the lock, same thing. So, there is a contradiction here, which means proof by 
contradiction, the algorithm is correct. And as opposed to 3(N — 1) all, we have an approach 
which is 2(N — 1). So clearly, there is a 50 % benefit. And the key insight is that defer the 
reply. That is the key insight. 


So every algorithm has to give us an insight. So the key insight over here is that we simply 


defer the reply. So this deferral is reducing 3 to 2. 
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Now, we will try to reduce this order N business to order VN. So, we will discuss the 


Maekawa's algorithm. 
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So, the main reason for a linear number of messages is basically we send a message to all 
the sites and we also expect replies from them. So the key insight is let us not send a 
message to all sites let us send a message to a subset of the sites. So let a set of processes 


associated with a process be called as request set R i, which is a subset. 
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For any two processes, let us guarantee that for two processes Pi and Pj, we have Ri R 
j #0, itis non-null, which basically means that if Process i has an intersection set, if Process 
j has a request set, then the intersection regardless of I and j is always non-null, or which 


means there is one process which is, which lies in the intersections of both the request sets. 


So we are not considering faults here. So let us take faults out of the picture. So it appears, 
so it happens to be the case that it is possible to construct such request sets where we can 
say that the size of each request set VN values, or what I can say is that for every i I can 
have a request set so as I consider any Process j, its request set, there will at least be one 


process in common. 


So the intersection will be non-null, and the size of these will be in the ballpark of VN. So 
of course, this will require results from field theory, so we will not go that far. We will 


consider a simpler argument. 
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So, consider N cross N processes, and consider Process i. So, Process i, I can give it a 
coordinate in the 2D space. So, or let us put it, so consider N processes and let us, consider 
N processes, let us arrange them in a VN cross VN matrix. And let be, this be the position 
of i. So, I can say that all the processes in its row and in its column comprise its request 


set. 


235 


So, what did I do? I took all the processes, N processes, arrange them in a VN cross VN 
matrix. So every process has a coordinate in this matrix. And I say that for Process i, I will 
consider all the processes in its row and in its column to be a part of its request set. And 
this automatically guarantees to me that regardless of the j, the value of j that you choose, 
for example it can be this one, as you can see, this will be the interest, request set of j and 


the intersection is non-null. 


So of course, the size of the request set over here is 2 VN, but that is okay. We will live 
with this. It is still order VN, so we will live with this. And let us see, using this assumption, 
let us go forward. But of course the only assumption we will need is that the intersection 
is non-null. And bear in mind that this is a simple way of constructing such kind of non- 
null intersecting request sets. But if you want to go to VN, you will have to use field theory, 


and that is also not that hard. 
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So here is the algorithm. So this algorithm is slightly involved but not all that involved. So, 
the first part is the same. So the first phase is the same where Process i sends a timestamped 
request message to every node in R 1, including itself. So previously what was happening, 


that you are sending a request message to every other node in the system. This is not 
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happening now. A message is being sent to every node only in the request set which 


includes itself. 


Upon receiving a request message, a node P j which is a part of R i, marks itself as locked 
if it is not already locked. And it returns a locked reply to P i. So if you can see this part is 
pretty similar to the Ricart-Agarwala construction. So it is not all that different at all. So, 
up till now it appears that we are proceeding along the same line, where if a node P j, 
process P j, in the request set if it is not already locked, then it says look, I have no issues 


in being locked for Process i. So let me mark myself as locked. 


So, it will return a locked reply, it will mark itself as locked, and it will return a locked 
reply to Pi in the sense it is committing to P 1 that I stand behind you and if you want to 
acquire the lock, go ahead and acquire it. And essentially, I have reserved myself for you. 
So, this is similar to the Ricart-Agarwala construction, even though this notion of 


reservation was not there. 


But nevertheless, the key idea is that if the node P j, if it is unlocked, it will mark itself as 
locked and sent a lock message back. So in the trivial case where there are no other 
competitors, the entire request set will send a lock message back, and then Process i will 
be sure that everybody in its request set, all of them were free and then there are no 


objections with me acquiring the lock. So, Process i can simply go forward. 
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Now, let us look at the next part. This part is slightly interesting. If Pj is already locked by 
a request from another process P k, here is what we need to do. P j will then place the 
request from P i, which is me, in a wait queue. So it uses the same basic notion of deferral 
that we used in the Ricart-Agarwala construction, that if P j is already committed to some 


other process P k, it will place P i’s request in a wait queue. 


If the locking request or any other request in the queue precedes the current request, then 
send a failed message. If the locking request, in this case the locking request is from P k. 
So if this request or any other request in the queue is, before, it precedes the current request 


in the sense it has a lower timestamp, then we send a failed message to P 1. 


So, in this case what happens is that P j basically tells P i that look, I have other messages 
in my queue which have a lower timestamp than you, so hence I am sorry I cannot lock 
and I am sending a failed message. Otherwise if that is not the case, which basically means 
that if P i is coming with the least timestamp, so then the message from P 1 should get a 


higher priority. As a result, an enquire message is sent to P k to find out what is happening. 


So what is the key idea? The key idea is that from the point of view of Pj, which is getting 
a request from P i, if it is unlocked, no problem, it will send a locked message. If, let us 


say, it is locked, then it will see what is the priority of the request from P i, vis-a-vis the 
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other messages it has. If it has a lower timestamp, then it clearly means that it is a high 
priority message. So it sends an enquired message to P k which has locked it to ask what 


has happened. 


If that is not the case, then it sends a failed message to P i basically telling it that look I was 
not able to lock. So these messages are clear. So the idea basically is that the contention 1s 
at Pj which is already committed, if I am a high priority requester in the sense my priority 
is low, my timestamp is low, high priority means timestamp is low, then what I do is I 


request P j to inquire with P k. 


This is always the case. I am in an office, somebody important comes, then I just gauge the 
person's importance with respect to what I am currently doing. So let us say, I am talking 
to somebody else on the phone, and let us say somebody comes who is extremely 


important, then I place the phone down and I talk to the person. 


But vice versa, if I am talking to somebody very important and let us say somebody less 
important comes to my room, then I just continue talking and send a failed message. So 


this is very similar to that. It mimics human behavior in this part. 
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So, when P k receives an inquire message it basically means that somebody is knocking on 


his door that look, somebody more important than you has arrived. What happened? If P k 
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has received a failed message and knows that it cannot succeed, then it will send a 


relinquish message. 


Fair enough. So if P k knows that it does not have a chance of acquiring the lock because 
it has already been spurned in that sense, it will send a relinquish message. Otherwise, it 
will simply defer the reply and it will not do anything. So who are the actors in this game? 
The actors in this game are P I, Pj and Pk. So think of this as let us say Pj is me and Iam 


talking to P k on the phone and then P i arrives. 


So, if Pi sends a request message, so of course, the trivial cases that there is no P k in the 
picture. In this case, I just send a locked message to P i and Iam done. But I am not looking 
at that. I am looking at the more interesting case where I am talking to P k in the sense I 


am already committed to P k, and P i arrives. 


So, if it is the case that either P k has a lower timestamp than P i or have other requests in 
my request queue who have a lower timestamp than P i, I simply tell Pi that look, you have 
no chance, you are too unimportant for me. So you, in a sense, go away. So I send a failed 


message. 


If that is not the case and P i happens to be somebody very important like, for example if I 
am talking to a student on phone, and the Head of the Department arrives outside my door, 
well then, what I tell is, I send the student an inquire message, and I tell the student do you 
mind if I put the phone down. So then the student basically sees if, if there is any chance 


of continuing the conversation. 


So, in this case P k basically sees if it has received a failed message or not from others. If 
it knows, anyway it does not have a chance. He does not have a chance in acquiring the 
resource. In this case, it means that does not have a chance in getting whatever he or she 
wanted. Then of course it makes sense to send a relinquish message to say that look, you 
go and lock yourself for somebody else, I do not care. Otherwise, what will happen is that 


P k will simply defer the reply and reply later. 
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This sounds to be a lot of fun with P i, Pj and P k. Now, what else happens? So now when 
P j receives the relinquish message, it realizes that P k does not have a chance of acquiring 
the lock. So what needs to be done? So it locks itself for the earliest message from P’i in 


its wait queue. 


So what it does is, it takes a look at its wait queue. So in the wait queue there are a bunch 
of requests including the one from P i. It locks itself, it locks itself for the earliest message 
from process P’i, which could be P i. So it locks itself. It changes its status. So it basically 
means if the student tells me on phone that the student is okay with me hanging up, I just 


hang up and talk to my Head of the Department. 


Furthermore, what it does is that it sends a locked message to P’i telling it that look, Ihave 
locked myself for you. And it adds the request from P k, the one it just ditched into its wait 
queue. So essentially, the idea is that if there is somebody important at your door, you ask 
the person whom you are talking to on phone to kindly hang up, and of course, you are 
polite, in the sense, you wait for a reply from that side. So in this case this is what is meant 


by P k deferring the reply. 
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So, this sounds simple enough. So when does the process acquire the lock? It acquires a 
lock when it has received lock messages from its entire request set. So when from the entire 
request set it gets locked messages, it acquires it. And why is this algorithm correct? Well, 
the algorithm is, of course, correct. The reason being, that one process can never lock itself 


for two other processes. 


So basically, given that two request sets will always have at least one intersecting process, 
and that intersecting process cannot commit to both, so it is not possible for two processes 
to acquire the lock and essentially break mutual exclusion. Mutual exclusion is essentially 


a long form for mutex. So it is not possible for this to happen. 
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Releasing the lock. Well, once a process is done, it sends release messages to all the 
processors in its request set. A process in the request set locks itself for the earliest request 
in the wait queue, and it sends a locked message, it sends it a lock message. If there is no 
such request, it marks its status as unlocked. So this part is easy, and this has been a 


common feature in, so this was there in the Lamport's algorithm as well. 
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Proof. So the proof for mutual exclusion is easy. It is basically assume two processes P 1 
and P j have the log simultaneously, which means there has to be a node P k that must have 
given locked messages to both at the same time. That is not possible. So as we have seen, 
if let us say I am giving a locked message to P 1, I will have to withdraw the lock message 


from somebody else. This is not possible. 


Given the fact that this is not possible, mutual exclusion is always guaranteed. But there 
are two more important things. What we need to look at is we need to look at deadlock and 
starvation, in the sense is it the case that Process A is waiting for Process B, Process B is 
waiting for Process C, Process C is again waiting for Process A. So do you have a circular 


wait? 


If you have a circular wait, so why are deadlocks important? Because in any algorithm 
where we wait, the chances of a circular wait are always there. That was not there in a 
Lamport's algorithm because we never waited. But in algorithms where we actually wait, 
the chances of a circular wait are definitely there. And, so there is a need to look at 
deadlocks. 
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And the other is the issue of starvation, which means that is it possible that there is one 
message but it starves, in the sense it never gets the locked. So we will look at them. We 


are saying it is not possible, but we will look at them in some detail. 


(Refer Slide Time: 28:49) 


Suzuki-Kasami Algorithm 
Token Based Algorithms T 


| Outline 


f 


Q@ Token Based Algorithms 
® Suzuki-Kasami Algorithm 


Smruti R. Sarangi Mutual Exclusion 


Al so this finishes the algorithm. So we will now go for a deeper discussion of Maekawa. 
It finishes the slide part. So now we will go for a slightly deeper discussion using our e 


notebook. 
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So what, again, was the idea? We had three actors, P I, P j and P k. What was P i doing? It 
was sending a request. If it was, if P j was unlocked, it was sending an immediate 
commitment that okay, I am locked for you. But that is not clear, that is clearly not the 
simple case and then we are not looking at this because let us consider the more difficult 


case where our good friend P j has already committed to P k. 


So then, in all cases within its request queue, it always puts in P 1’s request. So the requests 
are not sent once again. Requests are sent only once. They are clearly not sent once again. 
So even though the request is there, in here it sees the priority of this request with respect 


to all the requests that it has. 


If it finds that this request is the lowest timestamp, it is a time to tell P k that look, I have 
received a request with a lower timestamp from P i. And, so basically you kindly tell me 
what you are up to. And let us say, if this is not the case in the sense there are other requests 
including the request from P k that have a lower timestamp or a higher priority, then of 


course a failed message is sent and P 1 is told that look, I cannot do it immediately. 


So there is clearly no waiting over here in the sense that the failed message is sent 
immediately but the inquire message is also sent immediately, but as far as P i is concerned, 
it is not going to get a reply from P j immediately. It can get locked and failed immediately 


but not others. We will see why. 
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So when it asks P k to respond, if P k is holding the lock, of course, it will not respond. But 
let us say that P k has also requested a couple of people in its request set. Nobody has sent 
a failed message, it will just defer the reply, it will simply differ. And let us say if somebody 


says that look, I cannot satisfy your request, then of course, what comes back is a relinquish. 


So then P k also records the fact that P j is not locked for it then P j is now ready to lock 
itself for any other message. So it takes a look at all the messages in its request queue which 
now include the request from P i and P k as well. So let us say the one with the lowest 
timestamp is P 1 at this point. So then it sends it a lock message and it locks itself for it. Or 
let us say it sends a locked message to somebody else and ultimately P i’s request will 


become the one with the lowest timestamp at some point. 


So what we will discuss now is, this is the basic idea, so some key points, a request message 
is never sent twice. A request message is never sent twice. So that is important. So once I 
send a request message to everybody in my request set, it remains in their queues. So even 
if they tell me failed, they tell me failed that means that I can, if 1am Pi, I can go and lock 
myself for somebody else, but, but essentially, that is it. But a request message is never 


sent multiple times. That is point number one. 


After that, if you think about it, everything, so a message remains in the queue, so a 
message remains in the queue until I release. So until you do not release it, it remains in 
the queue. So it will continue to remain in the queue. There is no issues. What will happen 
is that mutual exclusion guaranteeing is not a big deal at all because the reason is that any 
node at one point of time, so node means the process here, is locking itself only for one 


other process. 


So, it will never be the case that actually two processors have acquired the lock at the same 
time because in their request sets, one intersection will be there, and that guy would have 
only committed to one, not the other. So that is the reason one of those processes would 
not have gotten a locked message from everybody in its request set. So mutual exclusion 


is guaranteed. 
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But what we should look at are two things. One is deadlocks. So deadlock, as I said, is a 
circular wait. And why? Because we are waiting over here. If P k has not gotten any failed 
message, we are essentially waiting. So since P k is waiting, P j is waiting, since P j is 


waiting, P 1 is waiting. So we have a wait over here, Pi, Pj and Pk. 


The other is starvation. Starvation means that I have a request, but the request is simply 
never getting the lock. I might not have a deadlock, there is no cyclic wait, but Iam simply 
never getting the lock. So there is an interesting result that is there which is that starvation 
freedom, if I can show, that I never starve, in a sense if I am there in the system I will 


ultimately get the lock. Starvation freedom implies deadlock freedom, is very important. 


Deadlock freedom does not imply starvation freedom, but starvation freedom implies 


deadlock freedom. 
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So let us look at proving this. So if you can prove it is starvation free, of course, it is much 
easier. But we will take a slightly different route. So let us understand what exactly is 
happening. So we have P i, we have P j in this case, and we have P k. So Pj, as far as we 
are concerned, is incidental. So pretty much, P i is waiting on P k. So it basically wants P 


k to relinquish such that P j can give P 1 the locked message. 
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So what is the relationship between the priority, or let us say, the time stamp of i and k? 
Well clearly, P 1 is waiting on P k, the primary reason being that P i has higher priority or 
it has a lower timestamp. So pretty much, this holds. So, if P k is waiting on some other 
process, does not matter what it is, so again, this will also hold. And similarly, if that is 


waiting on some other process, again, this relationship will hold. 


So it is clear that because of this less than ordering, a cyclic wait is not possible. So this is 
clearly an ordering which is not symmetric in the sense that it is not possible to have some 
other process z, and that waiting on i. That is not possible, because as you can see this 
relationship, the less than relationship needs to hold. So a cyclic wait is not possible. So 


given that a cyclic wait is not possible, a deadlock is not possible. 


So this basically means that when we have proven deadlock freedom, it automatically 
means that in the system, you will not have a situation where nobody is making progress 
and nobody is acquiring the lock. That will not happen. Somebody will. So that is very 


important that somebody will definitely acquire the lock. 


So now, let us prove starvation freedoms. As we have said, starvation freedom implies 
deadlock freedom, not the other way. So we did not find a good way of proving this 
protocol to be starvation free, but at least we have proven it to be deadlock free. So for 
starvation freedom, we do not have to do much, it is, this is actually easier. Given that we 


have proven deadlock freedom we can use some of it to prove starvation freedom as well. 
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So the key idea of starvation freedom is like this, that consider the process with, let us say, 
the least time stamp in the system. Let it be P 1, and let this be its request set, R 1. The 
moment it has sent a request to everybody, P i’s request is present in the request set of 
everybody. Subsequently, any other message that they send has to be more than the 
timestamp of P i. But in any case, we are assuming that among all the requests, T i is the 
minimum. It has the least timestamp, or in other words it has the greatest priority. So this 


is what we are assuming. 


Second, we are assuming that other requests are getting served because it is deadlock free. 
So at any point of time, some or the other requests will get served. Now, the question is, 
when will the time for this request come? So once the request message has been set, sent, 
it has been enqueued in all the queues. And we are sure, given the Lamport clock property, 
given the fact that we have this max operation, the timestamp always increases. It never 
reduces. So we will never have a situation in the future where a request will be sent with a 


lower timestamp. That is not going to happen. 


Why? Well, because we have sent messages everywhere. Every message that is being sent 
in the entire system, this is the smallest time stamp. So clearly, in the future, no message 
will be sent with a timestamp less than this. Given that that has happened, let us just 


consider all the nodes in the request set, and that includes P i itself. So they will basically 
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never send a lock, they will not send a locked message to any other request after getting 


the request from P i. 


So that is clear. So that follows from our property, and that is mainly because this is the 
lowest in the system. So subsequently, if anybody else tries to send a locked message to 
this, all that it is going to get is it is going to get a failed message back. So it is not possible. 
So after a certain time, of course, let us say, all the messages which are there in the system 
may clear up, but ultimately, what will happen is, that all the nodes in the request set R 1, 


will simply not participate in any other locking protocol. 


And they will basically ask all the members of their request set to kindly send relinquish 
messages, and they will not participate in any new locking protocol. So when they get it, 
first if they have locked themselves for somebody else, they will immediately send an 
inquire message asking it to release the lock. And subsequently, they will not entertain any 
newer message, they will keep on sending failed messages, and because it is deadlock free, 


gradually, the messages are draining out from the system. 


So ultimately, it will be the case that this request, the, the, the request from Process 1 is 
pretty much at the head of every queue and all of them have sent a locked message back 
because they are not sending a message to anybody else, and the system has to make 
progress, it is deadlock free, no other request is able to make progress, a time will come 


because all of their timestamps are greater than T 1. 


And furthermore, so consequently they have a lower priority, and they will not be able to 
make progress and they will also not be able to lock the nodes in the request set because 
the request set will not accept any requests, and then the request set, the nodes of the request 
set will continue to send locked messages to P i, and ultimately, it will get all the locked 


messages and will start to execute. 


So this means that this protocol is starvation free because it is never possible that one 
request will wait for eternity to actually get the lock. Why? Because ultimately, it will 


become the liquid, request with the least time stamp, and all the notes in its request set will 
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either inquire and ask for relinquished messages and stop any subsequent requests by 


sending failed messages. So the victory is for this process. 


So this is why starvation freedom is avoided. And so this completes the proof that this 


protocol is both deadlock free as well as starvation free. 
(Refer Slide Time: 43:45) 
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Let us now compute the number of messages that are sent. So for a single request, we need 
to send VN - | request messages. That is to every node in the request set. We need to send 
VN - 1. So we will receive square root of N minus | locked messages. If, let us say, our 


luck is really bad, we might also need, we might also receive these many failed messages. 


Furthermore, if, let us say, the priority of the requester is very high, then a maximum of, 
again, these many inquire messages will be sent. The reason is that, let us say, at every P j 
in its request set, if it has the highest priority and that is locked by some other process then 
an inquire message will be sent, and it is possible that a maximum of these many 


relinquished messages will come back. 


So, we can say that the maximum number of messages is 5(VN — 1), which is conveniently 
put as order of VN. So, what we have done is, we have done significantly better from 3(N 


— 1), we came down to 2(N — 1), to order of VN. So, the Lamport's algorithm was this 
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much. Then, the Ricart-Agarwala algorithm was this, and then we have the Maekawa 


algorithm, which is order of VN. 
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We will discuss the two token based algorithms, the Suzuki-Kasami Algorithm and the 
Raymond's Tree Algorithm. So they differ in terms of their philosophy as compared to the 


previous algorithms. 
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So, the idea over here is that a site can access the critical section, a process can access the 
critical section if it has a token. So, every process maintains a sequence number or a request 
ID. So, a request is of the form, it is a tuple of (i , m). Soi means that, i refers to process P 
i. Itis the process ID i. And m is the sequence. It says that it needs the mth access to the 


lock. So this does take some amount of past history into account but not much. 


Essentially, every request in this case is a tuple of the process ID, and the access number 
of the lock, in the sense, is it your third access or fourth axis or fifth access to the lock. 
Furthermore, every process P i keeps an array of, a sequence array | to N where the jth 
entry is the largest sequence number it has received from j. So, think of this as some sort 
of a vector clock. So the idea is similar to a vector clock. We were using scalar clocks 


earlier. 


So, this has been inspired from the vector clock. So, when P i receives a message (j, m) 
from Process j, what it basically does is akin to what we do in a vector clock, it sets the jth 
entry as the maximum of what it had, and the value that Process j is sending. So, what is 
the idea? The idea is that you have a scalar clock which is the sequence, which is m in this 


case, and you have a vector clock, a sequence array. 


So, this is the state that a process keeps. What is the token? It is a queue of requesting sites, 
sites are the same as process over here. So, it is a queue and it is an array of sequence 
numbers, so, we will see what they do, Array C. So C1 is the sequence number of the latest 
request that P i executed. So basically, it is basically saying that, let us say, if C i = 3, this 


means that P 1, Process i just finished the third access to its lock. 


Just finished means finished in the past, the third access to its lock. So, this is the arrays C 
i. So we have many data structures here. So, I would request you to kindly go through this 
piece of the video many times to memorize what these are. So, for a process, we have its 


internal variable m, which is its, essentially its lock count, it is a scalar clock. 


Seq is a vector clock, and then for the token, we have two fields, one is the queue of 
requesting processes and other is an array of sequence numbers C, capital C, where C 1 is 


the sequence number of the latest request that the ith process executed, ith process finished. 
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So, there are these four things for a token Q and C, and for a process m and seq. So let us 


not forget these, m and seq for a process, Q and C for a token, not to be forgotten. 
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So to request the lock, what you do is that, so this is very similar to a vector clock. It 
increments its ith entry. So, it pretty much implement, increments m. So, in the vector 
clock, it says that look, I am going to request, so it increments that entry. And the current 
value, it puts a temporary variable val, and this I, val is sent to all the sites. So, it is basically 
saying that look, I am Process 1, I just incremented my internal scalar count. Now it is val. 


And I need access to the lock. 


When P j receives 1 val, what does it do? Well, it does the standard thing that it increments 
its vector clock. So it sets the ith entry as the max of seq, j, i and val. If the token is idle, in 
the sense if the token is idle, meaning that, we will see what it means, but let us say it is 
not being used and it is there with P j, then it sends the token to P i, if, if, so this if is 
important, if seq j I, which means as far as Process j is concerned, the value of 1 that it is 


getting, which was essentially set by this operation, = C [i] + 1. 


Which basically means that look, the token is recording that Process i has finished three 
requests and now the fourth request is coming. 4 = 3 + 1, which means it is the next request, 


which means the token should be sent. So, it means it is not a stale message, it is not one 
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of the old messages floating around, and this is genuinely a fresh request because this 
process has already accessed the lock three times. It wants to access for the fourth time, so 


access may be given. 


So, the token goes to a process, then it enters the critical section, which means it grabs the 
lock. So, in this case establishing mutual exclusion is rather trivial because if you have the 
token then you enter the lock and grab the resource, otherwise you do not. And given the 


fact that only one process has a token at any point of time, mutual exclusion is trivial. 


So, we basically have to look for starvation freedom and deadlock freedom. And I want to 
repeat that we were able to prove in the previous lecture that starvation freedom implies 


deadlock freedom. 
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Releasing the lock. Again, very easy. P i releases the lock as follows. So it sets the ID C 
[i]— seq;[i] as the ith entry, which means that if, let us say, it finished the fourth access to 
the lock, within the C array of the token, it is going to add 4. As simple as that. Furthermore, 
furthermore, for V j, it adds Pj to Q, which is the array within the token, which is the queue 
within the token, as long as if seq;[i], which means that in i's vector clock we will look at 


the jth entry, if this is equal to C [j] +1. 
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Which means that it is aware of the fact that in some other Process j, has already acquired 
the log C j times, but it is desirous of accessing it once more. So what it does, and let us 
say it gets to know that, because messages are being sent around, so j may be, send some 
message to k and then k send it to 1. So that is all right. So, it essentially scans its sequence 
array whenever it sees this relationship to hold, which means that Process j is desirous of 


accessing, acquiring the lock for one more time. It adds it to the array Q, the Q queue. 


Then what it does is, it dequeues P k from Q and so it dequeues one process from the queue 
and it sends the token to P k. So to, for starvation freedom, pretty much, if this is a first in, 
first out queue, so it is guaranteed that as long as the process enters the queue, it will be 


dequeued and the token will go to it, and it will access the resource. 
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So what is the message overhead? Well, the message overhead is that look if I am the 
process, I have the token, I immediately grab it, otherwise it can be N. So N basically means 
that pretty much I send a message to (N — 1) other processes, and then ultimately, one of 


the processes sends a token back to me. So it is N - 1 + 1. As simple as that. 


257 


(Refer Slide Time: 55:01) 


Suzuki-Kasami Algorithm 
Token Based Algorithms 7 


| Releasing the Lock 


Releasing the Lock 
P; releases the lock as follows: 
® Cli] — seqi{il 
0 Vj, adds P; to Q if seqi[j] = Ci] +1. 
@ Dequeues P; from Q, and sends the token to Px 


Message Overhead: 0 or N 


2 


— W-i+) 


Smruti R, Sarangi Mutual Exclusion 


Suzuki-Kasami Algorithm 
Token Based Algorithms 7 


| Releasing the Lock 


Releasing the Lock 
P; releases the lock as follows: 


+E a) 
@ Vj, adds P; fo (9) it sea) = Ci] + 1 


@ Dequeues P, from Q, and sends the token to ae 


Smruti R, Sarangi Mutual Exclusion 


258 


Suzuki-Kasami Algorithm 


Token Based Algorithms 


| Releasing the Lock 


Releasing the Lock 
P; releases the lock as follows: 
oCijtseqi] 5 4 @ 
o Vj, adds P; to Q if seqi[j] = Ci] +1. 
@ Dequeues P; from Q , and sends the token to P, 


Smruti R. Sarangi Mutual Exclusion 


So, why do I say that the requesting process gets the lock in finite time? Because the request 
will reach all the processes in finite time. So by induction, one of these processes will have 
the token and finite time because if I do not get it, if I assume that in an N - 1 node system, 
one of them gets in finite time, if that is assumption that I make, one of them will get it, 


that process would have seen my request. 


So my request will get added to the q queue by precisely which line, by precisely this line, 
which means that if I have sent a request, if I have sent a request and if the request either 
reaches directly or indirectly, it does not matter, but as long as it reaches any node which 


currently has the token, it is bound to add me to the queue. 


And sooner or later my time will come, and at max, N processes will be there before me, 
N - 1, rather, and each one of them would have sent a message, N - 1 messages slash 
processes, that is the way that you interpret this line, and there will be N - 1 messages slash 
processes before it. So they will finish and my time will come. So clearly, I cannot starve. 
So there is a built-in mechanism where I cannot starve because the moment that a process 


sends a request, it gets added to the queue. 


And furthermore, a process can also not swap the network by sending multiple messages 
because of precisely this little check that is done over here, which, so this little check 


basically says that, let us say, that you have acquired the lock three times and then in the 
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next message, you send four. Then, only one will be added. If you send one more and then, 


let us say, it becomes five, then you lose your chance. In a sense, you get disqualified. 


So you can only have one outstanding request at any point of time. This also stops 


starvation to a large extent, in this algorithm. 
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So in this algorithm, proving starvation freedom was very trivial. So, that is the reason I 
am not discussing this any further because this kind of extends what we did in the previous 
two algorithms, Ricart-Agarwala and Maekawa. And starvation freedom is not there, 
primarily because any process will basically send messages to the rest of the processes. 


The token has to be there with somebody. 


So the token, it is not possible that the token cannot be there with anybody. Otherwise, the 
token is with this process, then there is no problem at all. If the token is there with 
somebody, it will see this request, it will add this process to the queue and be rest assured, 
after a maximum of N - | processes finish the request, the originating process will get the 


token. 


260 


(Refer Slide Time: 58:20) 


Token Based Algorithms Raymond's Tree Algorithm 


| Main Idea 


request queue 


@ Nodes are arranged as a tree. Every node has a parent 
pointer. 


@ Each node has a FIFO queue of requests. 
@ There is one token with the root and the owner of the token 
can enter the critical section. 9 
A “) @ Message Complexity: approximately O(log(N)) for trees 
with high fan-out 


Smruti R, Sarangi Mutual Exclusion 


Now, let us discuss one more approach, Raymond's Tree Algorithm. So, here the idea is 
that nodes are arranged as a tree. Every node has a parent pointer. Furthermore, each node 
has a FIFO queue of requests. There is one token with the root and the owner of the token 


can enter the critical section, which means grab the lock. 


So in the previous case, we said that the message complexity is either 0 or N. So let us say 
worst case is order N. So this is reducing O(N) to O(log(N)) for trees with a reasonably 
high fan-out. That is fine. But if we take a look at this tree, what we can see is a very simple 
arrangement. So, let us assume it is a nice and balanced tree. Furthermore, every node has 


a parent pointer, a FIFO queue at each node. 
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So, let us see how this works. So the way this works is that as the token moves across the 
nodes, the parent pointer changes. So let us say the token moves from here to here, what 
you actually see is that, so just look at the previous thing. All the notes were essentially 
pointing towards the root, and the root contained the token. In this case, the root has moved 


because the token will always there with, will always be there with the root. 


So pretty much, this arrow over here got reversed. Previously it was pointing in this way, 


but it got reversed. Now, as you can see, still all the nodes point towards the token holder. 
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And wherever the token goes, let us say, next time the token goes over here, so as you can 
see this arrow also got reversed. So pretty much all the nodes can approach the node that 


contains the token. Of course, the arrows do get reversed as the token moves. 


So, they always point towards the token holder. So, it is always possible to reach the token 
by following your parent pointers. Always, your parent pointers will take you towards the 


token. That is the most important learning over here. 
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Requesting for a token. So what it does is that within its own request queue, the node add 
itself, adds itself. So what a node would do is, within its own FIFO queue, it will add itself 
if it is requesting for the token. It will also forward the request to its parent. The parent will 


add the request of the child to its request queue. 


Furthermore, if the parent does not hold the token, and it has not sent any request to get the 
token on behalf of any other child or itself, it sends a request to its parent and so on. So this 
process continues till we reach the token holder, which is the root. So the idea is very 


simple. I want to get the token; I insert a request into my FIFO queue. So far, so good. 


Subsequently, what do I do? Subsequently, what I do is I send a request to my parent. And 
then, my parent basically sees whether it has sent any request on behalf of any other child 


of its or not. If not, then it sends a request to its parent. Its parent again inserts the request 
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in its own FIFO queue. Again, sends a request to its parent, it does the same. Ultimately, 


we reach the token holder, which is the root. 
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The token holder will wait till it is done with the critical section. I mean, till it is done with 
the lock, it will wait. Subsequently, the token holder will take a look at the queue of the 
root node, which is the road, node that holds the token. Then, the head of the queue, 


whatever node it is, it will remove the entry, and it will update its parent pointer. 


So basically what it will do, what it will do is that it will remove the entry, send the token 
to that entry and update its parent pointer. So any subsequent node will do the following. 
It will dequeue the head of the queue. If self was at the head of his request queue, then it 
will enter the critical section, otherwise it will forward it to the dequeued entry. So let us 
give an example. So let us say that this is the way that we had. And let us say this was the 


original requested node. And let us number these nodes, 1, 2, 3 and 4. 


So then, what happened is that this node added itself, 1, to it, and it sent a request to its 
parent. So the parent, let us say, was not interested. So it also added node number | to it, 
and it sent a request to its parent, and let us say this was a token holder. So it added node 


number 3 to it because as far as it is concerned, node number 3 is requesting. 
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Then, once the token was released, what it did is that it said okay, 3 is at the head of my 
queue, so then it reversed the direction of this arrow, and it made 3 the token holder. 3 did 
the same thing. It said okay, all so | is at the head of my queue, so let me make | the token 
holder. So, it reverses the direction of this pointer, 1 becomes a token holder. And | sees 


that it is a token holder as well as 1 is at the head of its queue, so it uses the value. 


If self was at the head of the request queue, then it will enter the critical section, otherwise 
it will forward it to the dequeued entry. So, after forwarding the entry, a process needs to 


make a fresh request for the token if it has outstanding entries in its request queue. 
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have to flow back to the starved process. 
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So, mutual exclusion is very simple in any token based algorithm because the idea is that 
no two nodes can have a token at the same time. Deadlock. Circular wait cannot occur 
because all the nodes wait on the node that holds the token. And furthermore, we are not 
losing messages and we will always have a rooted tree. And given the fact that we have a 


rooted tree, what will happen is as follows. 


(Refer Slide Time: 65:06) 
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Let me go to our paper over here. So pretty much what is happening 1s, if, let us say, this 
node is a requester, it is waiting on this node, and then this node is waiting on this node, 
but it is clearly not possible. So always as you can see these waits are going along this 


parent edges. So a child always waits for his parent, but a parent never waits for his child. 


So let us say that the child always waits for its parent. So, this is always happening that, let 
us say, if I do not have the node, I am asking my parent for it. If I do not have the token, I 
ask my parent. If my parent does not have the token, it asks its parent. So on and so forth. 


But the point is you will never ask a child. 


As a result, given the fact that the tree itself is a cyclic, you will never have a cycle of 
dependencies because you will have never have a cycle of parent pointers because you will 
not have a circular wait. No circular wait implies deadlock freedom. So this approach will 


not have a deadlock. 


So now, what was left? Starvation. So ultimately, a starved process, this request, will come 
to the front of all the request queues. At this point, it will have the highest priority, and the 


token will have to flow back to the starved process. So let us take a look at this once again. 


So let us say that there is one process, let us say this one. So what will happen is that this 
process, so let us call it Process x. It will insert x in its own queue, send a request to its 


parent, x will be over here, x will be over here. And let us say later on even if the parent 
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pointers change, and let us say that the parent pointers go this way, that is all right. You 


will still have x over here, you will still have x over here. 


So then, what will happen is the moment that the request is satisfied, so basically mind you, 
we are always adding x as long as there is no other outstanding request that has gone from, 
that has gone from this node to its parent. So then ultimately, when it will become the 
oldest, everybody will find x to be at the head of its queue, so it will dequeue x. Again, the 


token will flow this way, the token will flow this way, this way and this way. 


So the advantage of any algorithm that has a queue is basically starvation freedom where 
we say that look, you will have a queue and ultimately you will come to the beginning of 
each queue, and when you come to the beginning of a queue, it is guaranteed that you will 


get it. 


So basically, nobody can displace you from your position. And given the fact that the only 
direction in a queue that you can move is up, which means towards the head, ultimately 
you will come at the beginning of all queues, and you are bound to get the token, which in 


this case, will happen. 
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So given that we have seen this, token based algorithms are very simple. The key algorithm 
that you should focus on is the Maekawa’s Algorithm. And this algorithm is kind of 
involved but I think we have discussed this in great detail. The token based algorithms are 


quite simple. 
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Welcome to the lecture on leader election. So in this lecture, what we will study is that will 
be given N nodes, and in a distributed fashion, these nodes need to run an algorithm where 
they need to elect a leader. So we are not assuming any faulty processes here. So we are 


assuming that they will elect a leader just by exchanging messages. 
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@ Leader Election in Trees 


Smruti R. Sarangi Leader Election 


So we will start with a ring overlay, where all the nodes are arranged as a ring. So every 
node can either send a message to its clockwise successor or its anti-clockwise successor, 
or send a message to the left or to the right. That is the only messages that it can send. So 


this network is kind of simple and constrained. 


So we will look at an O (n?) message complexity algorithm, and then refine that to an O(n 
log(n)) version, and then we will consider another version of the problem where the 
processes are arranged as a tree. So when they are arranged as a tree, we will find that they 


have better properties. 


So we can reduce this n log n further and make it in the ballpark of O(n). So let us discuss 
leader election algorithms. So I stand corrected, not the ballpark of O(n), but O(n). So, let 


us go to the tree part and we will discuss, but first we solve the simpler problem with rings. 
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@ Messages can only be sent to the clockwise neighbor(left) 
or anti-clockwise neighbor(right). 
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So in the ring based overlay, the idea is that we need to choose that process that has the 
smallest ID as a leader. It has the smallest ID. So of course, there is an asymmetry in a 
sense, we are saying that look, a node with a large ID or not the smallest ID cannot be the 


leader, but that is fine. In any distributed algorithm, there is some asymmetry. 


If you do not have asymmetry, there is a classic theorem, which says that you actually 
cannot make progress. So I will explain with a simple example. So let us consider a 
corridor. So let us consider this person to be Alice, this person to be Bob. So you would 
have often seen that if, without any understanding beforehand, if let us say, two people are 
coming face to face in a narrow corridor, Alice is moving this way, Bob is moving this 


way, then they can collide. 


Then what Alice will do is Alice will move to her left, what Bob is going to do is he is 
going to move to his right, and then they will again collide. They will again be face to face 
again. Alice will move, but Bob will also move. So they will again be face to face, they 
will again collide. So this does happen many a time with us, particularly when we come 


face to face with somebody or in a narrow corridor. So this does happen. 


So then typically what happens is one person says that, look, I am stopping you go. But 


what if the other person also says that? In this case, it is provable that unless we somehow 
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break the symmetry, so unless we say that, between you and me, whoever has more letters 
in his name, that person gets to move, the other person gets to stand. And we, and if we 
have the same number of letters in our names, then let us say, let us break the tie by the 


first letter, otherwise by the second letter and so on. 


So in that case, there is a tie breaking. There is an asymmetry we are introducing to 
essentially solve this kind of a situation. So this, for Hindi speakers, there is a name for 
this. It is called Pehle Aap. And for non-Hindi speakers what this means, this is like a 
courtesy, which is given to the other person saying that you may cross me first, but if the 
other person also returns the courtesy, and he says, no, no, not me, but you cross me, again 


there will be a face to face situation. 


Given the fact that we have that, breaking the symmetry and introducing some level of 
asymmetry is very important. So given the fact that we are only constrained in terms of the 
messages we will send either to a clockwise neighbor or to an anti-clockwise neighbor, 
breaking the asymmetry is even more important because we cannot send messages to 


others. And clearly, we do not want to break the rules by having a centralized algorithm. 


So it should not be the case that we send messages to one node, but again, which node? So 
that node has to be the leader, but no leader has been elected. So we still want to have a 
distributed algorithm, and kind of, let us say, compute the minimum ID. And then we all 


agree that that node is our leader. 
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So we will discuss the Chang-Roberts Algorithm. So every node p, so sort of convention 
we will assume over here is that I am node p, and the other node is q. So always, I am p 
and the other one, other node is q. So if, let us say, one node, if one node p is initiated, 
which means that I am just starting, I am just starting the process of leader election, but 


you need not have one in initiator. You can have many in initiators. 


So then it will look at its state variable and set it. So it will set the state variable to find 
which means that node p is in the process of finding a leader, electing a leader. So what it 
does is that it sends its own ID to its neighbor. Does not matter clockwise neighbor or anti- 
clockwise neighbor, does not matter. You choose a direction. It sends its own ID to its 


neighbor. 


As long as the state # leader, which means that the state indicates that the leader has been 
elected, we continue the algorithm. So what do we do? We come to Line 3. So we receive 
a message from q. So we have an overlay like this. And, so Iam not p, I send my ID to my 
neighbor, so on and so forth. And finally, from my neighbor, I receive a message. I stand 
slightly corrected. I do not receive a message from q, but what I actually get is, [receive a 


message whose content is q. 
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So I receive a message from my neighbor from the other side where the content of the 
message is q. But if that is the same as the ID that I had sent, then I declare the state to be 
leader, and I say that the operation has finished. So what I am basically saying is that, let 


me maybe draw this, or let us go to another one of our bigger diagrams. 
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So, what I am seeing over here is, I am node p. I send my ID to my neighbor. It does 
something, we do not know what it does. Whenever I get a (message), so I can also get 
messages from my neighbor because my neighbor can also send me messages. So let us 
say, my neighbor sends me message with content gq. My neighbor is not q, the content is q. 
Even though broadly speaking, we do assume that p and q in the general case are different, 


but let us say I get a message from my neighbor whose content is q. 


If p #q, so we are of course, making the assumption here. One assumption that is being 
made, that every node has a unique ID. Without this, our system is not possible. So you 
can assume it is IP address, which is for example, unique, or there is some other way of 


providing a unique ID. So we are not getting into that. 


But once we have said that every node has a unique ID, and let us say this direction is 


clockwise, so from my anti-clockwise neighbor, if I, if I get my own ID back, then I say 
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that, look, the algorithm is terminated. I set the state equal to leader, which basically means 


that as far as Iam concerned, a leader has been elected, and the leader is me. 
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Otherwise, if, let us say, I get q from my neighbor, where q < p, which clearly means that 
I am not the leader. So in this case, I am the leader. In this case, I am getting a message 
from my neighbor, with a value which is lower than my ID, and I am p. So if my state is 


fine, I set my state to lost, which means that I am not the leader. 


Fair enough. That is an expression in humility. That is, the reason being that already another 
node exists whose ID is lower than p. So that node could be the leader or some other node 
was ID is less than the ID of that other node, that is fine, but the leader is not me. So if this 


is the case, I just say I have lost, and that is it. 


Then what I do is I propagate the value q to next p. In the sense, next p is my neighbor, is 
my clockwise neighbor. So I just propagate the value q to next p. Fine. So then this else is 
basically if p = q or if q < p, this else is basically the else for if p is the initiator. So of 
course, here there should be one end over here, which I am not showing on the slide, and 
another end for the while loop. So let us assume that these ends exist. So if p is the initiator, 


then you do it, but else means that p is not the initiator. 
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And so if p is not the initiator, it basically means that p is not interested, p is not interested 
in becoming the leader. Then of course, there is no problem. So then what it does is that it 
receives q from its anticlockwise neighbor, and it simply forwards the value of q that it 


gets, to next p. 


And then, if it is sleeping, it just sets the state to lost, seeing that look, I do not care. But 
anyway, in any case, I am not the leader, and I am also not interested in becoming the 
leader. So essentially, I have no stakes in the game. So as far as | am concerned, my job is 


to only forward, and I am doing the job of forwarding. 
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So let us now come back and summarize. I will get rid of the ink on the slide. So the idea 
basically is p is the initiator, it basically means that I am interested in becoming a leader. 
Then of course, if I am interested, then what I do is, I propagate my own ID down the 


overlay. 


And let us say my own ID comes back to me, which means that nobody absorbs it. This 
means that every single node, this condition should have held true, which means that my 
ID is actually the minimum. That is the reason my ID came back to me, and I became the 
leader. Otherwise, at any point, if my ID was not the minimum, it would have been 


absorbed in the message, it would have been dropped. 


In the meanwhile, what I do is, if I receive a message from my neighbor, and if, let us say, 
that message has a lower ID than mine, then clearly, I am not the leader. So I set my state 
to lost, and I just propagate the message down the ring. And of course, this condition is not 


handled, the reason is that I do not do anything. 


If, let us say, I get a message with an ID which is greater than mine, so clearly, the originator 
of that message is not the leader. So I do not do anything. I just drop the message and I do 


not propagate it. 
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So ultimately, if my message ends up being propagated down the ring via all the nodes, 
and it comes back to me, which means this condition holds, then this means that I am the 
leader, because nobody absorbed the message. Given the fact that nobody absorbed it, it 
basically means that my ID must be the minimum. And of course, if I am not interested, I 


just propagate the message down the ring. 
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So what is the message complexity? Well, worst case you can have order N initiators. The 
leader’s message is going to be sent 10 times throughout the ring. For other initiators, the 
message will be sent a maximum of N - i times. So what is 1? So let us say I am the second, 
I am not the minimum, but I am the second minimum, but I am just beside the minimum 


node. 


So then, if let us say, this is the minimum, then I am over here, then my message is going 
to propagate all the way till this point. And so it will propagate & (N — i) times. And let us 
say after the second minimum, the third minimum is over here, then its message will be 


propagated N - 2 times. 


So if let us say this, this is kind of the worst case. So if I just sum this up, this comes to 
O(N), which is a very standard result in algorithms. And it basically says that look, the 


worst case arrangement is when they are kind of arranged in ascending order. 
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And so then the minimums message, of course, will go through all N hops, but the other 
ones will go N - 1, N - 2 and so on. And the arrangement of (kit), of course here is that they 
are arranged ascending orders around the ring. So, given the fact that this is O(N7) 


messages, we should make some effort to actually reduce this. 
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So one optimization is that globally kind of broadcasting to everybody is not necessary. 


We can do something better. So let us see what we can do. 


(Refer Slide Time: 15:02) 


Leader Election in Rings 
O(n log(n)) Algorithm 


| Outline 


Q Leader Election in Rings 


@ O(n log(n)) Algorithm 


Smruti R. Sarangi Leader Election 


279 


So here is an O(N) time, n log n number of messages algorithm. 
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So see, the main reason for an O(N?) complexity is because we want each message to 
travel as far in the, as far as possible in the ring, which is essentially in the ballpark of 
O(N). And since we have a sufficient number of nodes that are traveling roughly the entire 


ring, we have an O(N7) message complexity. 


So instead of sending a message to everybody, can we find a way to filter the set of 
messages, a similar idea to Maekawa’s algorithm in mutual exclusion? So recall that the 
previous lecture was a mutual exclusion and we had discussed the Maekawa’s algorithm. 


So can we do something similar? 


So what we do is that we gradually, kind of, consider larger sizes of windows or gradually 
consider larger set spaces in a sequence of rounds. So we will define what around is. So 
what we do is that if we consider a ring, we define a window like this, only one of its 
members can participate in the next round. And also these, these windows will gradually 


increase. 


So we will see how, but essentially the idea is that we create these small communities of 
nodes and every community elects its own leader. And then these leaders participate. And 


again, they form a community. So recursively, our idea is to somehow reduce the number 
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of messages by creating communities, then a leader of a community elect the leader of 


leaders and so on. 


So if you can kind of factor, in every round, if you can reduce the number of nodes that are 
participating in the algorithm by a factor of 2, that would be really great because in that 
case, we will have a maximum of log N rounds. So the complexity, even if it is O(log (N)), 
will be bounded by O(N log(N)), which is what we want. So let us see if we can do that. 
The idea is you elect a leader, create communities, elect a leader out of community leaders, 


so on and so forth. 
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So let us look at this algorithm. So what we do is that in this case, we send one message to 
the left and to the right. So the previous algorithm was unidirectional. This is bidirectional. 


So the message has four fields, probe, id, 0 and 1. So we will see in a second what they are. 


So, what they basically mean? so, what they basically mean is probe means it is a probe 
message, so probe is the message type, id is my ID, and id is the ID of my node. So this is 
what, and 0 and 1 kind of establish how far the message will go. So let me explain with an 


example. 
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So, this will gradually be cleared, but at least out of these four fields at the moment, you 
just appreciate two fields which are probe and id. Probe is the type of the message, which 
means I am probing for a leader, and id is my current ID. So I send messages to my left 


and right. Others also do the same. 


So, let us say Iam a node and I receiver message from my left or right neighbor. And let 
us say, the message that I receive is equal to my ID. So akin to the previous algorithm, I 


declare myself as a leader, leader as j, terminate. 


And so, it is the final leader. But what I am saying is that let us say in this process of 
receiving a message, if I receive a message, which is my ID, so let us say that the message 
has passed through the rest of the nodes. So I will declare myself the leader and I will 


terminate, but let us just keep going forward. 


Otherwise, if the message that I get is probe, j, k d. If j < id, and furthermore d < 2*. So we 
are now in a position to define these two fields. So this field over here is a round number. 


Fine. And this field over here is basically the distance from the originator. 


So if I say, d < 2*, so if this happens, which means that if j < id, and, so which means I am 
sending it like this, and let us say that my current ID is id and I get j, and j happens to be 
lower than my ID, then it means that let us say, if I got the probe from the left, I send it to 
the right, which means I just propagate in the same direction, or if I got it from the right, I 


send it to the left. 


So what is the key idea? The key idea is, look, I got a message. The message has j as its 
identifier. And my ID is called id, i and d. If j < id, it means that potentially I am not the 
leader. The other message’s originator is the leader. So what I need to do is, I need to help 
it propagate further. So I simply propagate the, propagate it in the same direction, in the 
sense, if it came from my right, let us say, propagate it to my left. And it came from my 


left, propagate to my right. 


So I just make it go in whichever direction it was going, but with a small catch. So it is still 


a probe message. The identifier of the message is still j. This still remains as k. So k is 
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basically the round number, which is the same, we do not implement the round yet, but we 


say that its distance from the original node has increased by 1, so this becomes d + 1. 


So what we do is we make this distance check. So this distance, so, k basically gives you 
an idea of the size of the window. So what we are saying is that if, let us say, this is the 
original point, then we will have 2* points on this side and 2* points on that side, roughly, 
and furthermore, the maximum distance that you are allowed to propagate in either 


distance, in either direction is 2*. 


The moment d becomes less than equal to, then we bounce back. We do not propagate it 
further. We just bounce back. And every hop that we propagate, we increment this distance 
and make it d + 1. So it is important that you understand what exactly is happening, so I 


will write this one again. 


There are four parts to the message. Probe is the type of the message, id is the originator, 
then of course we have two more, k and d. So, k is the round number. So what I do is I 
consider arcs. So if, let us say, I consider an arc of nodes, so then, and let us say the 
originator is over here, so I pretty much consider 2* nodes clockwise, and 2 nodes 


counterclockwise. 


So for me, this is the size of the window that Iam considering. And as I move my message, 
my hop count increases. So the last field over here is the hop count. And what I see is if I 
receive a message, I propagate it in the same direction. If it is coming, if it is going 
counterclockwise, I send it counterclockwise. If it is going clockwise, I send it clockwise 


until I reach the end of this window. 


So I send, so let us say drawing it once again, if id is over here and the ends of the window 
are over here, I send one message all the way up till here, one message all the way up till 
here until I reach the end. And how do I know I reach the end? Well, every hop that I 
propagate, I increment the hop count until the hop count reaches the maximum, which is 


this. 


Now, if let us say reaches the hop count in the sense I reach the end of the window, and at 


the end of the window, same j < id holds, and d = 2* So what did I say? We consider 2% 
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nodes on this side and 2* nodes on the other side. So let us say at the end, at the end of the 
window, I still find j < id in a sense, in this case, id of course is a local property for that 


node. 


But as far as we are concerned, the originator, at least from this path has the minimum. We 
send a reply back. So the message came from my left, I send a reply back to my left and I 
say the reply path is a reverse path. So I just send a reply down the reverse path. So the 
field is, again, reply. You are sending a reply to the originator, which is j in this case. And 


k, of course, is a round number. 


So, you send a reply back. And similarly, the other side also sends reply back. Of course, 
this was the minimum on these sides. So, one thing that is clear is that in this arc, if it gets 
areply from both sides, so, it clearly means that if let us say the ID of this node is j, j is the 
minimum in this window. We do not know about the rest of the world, which is like this, 


but that does not matter. At least in this window, j is the minimum. That is what we know. 


So, gradually what we will do is we will increase the size of the window, which is 
something we have promised. But till this point, things should be clear that we just define 
the small arcs in the circle, equal sized arcs on both sides, simply send a message. And of 
course, a node can absorb it if the original ID is more than the ID of the node. But let us 
say it is less than the ID of the node. So I am just repeating that we again have unique node 


IDs. 


So, if it is less than the ID of the node, the node just propagates it until it reaches the end 
of the window. From there, the message bounces back. So you send a probe message, and 
what comes back is a reply. And let us say, if you receive replies from both sides, this 


means that if the ID is j over here, it is the minimum in this window. 


(Refer Slide Time: 27:10) 
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Leader Election in Rings 
O(n log(n)) Algorithm 


| O(nlog(n)) Time Algorithm - II 


1 receive (reply,j.k) from left(right): 
iffj ¢ id\then 
2 send (reply,j,k) to right(left) 
3 end 
4 else ( /> ia) 
5 if received (reply,,k) from right(left) then 
send (probe,id,k+1,1) to left and right 
a © be 
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Leader Election in Rings 
O(n log(n)) Algorithm 


O(nlog(n)) Time Algorithm 


1 initialize: ) 
send (probe,id,0,1) to left and right 


1, kd 

receive (probe,j,k,d) from left(right): f R 
if j = id then 

——— ~ 

2. leader +j ) 

Terminate; _ 


- j 
None 
a 


nud 
3 end number 
4 ifj < id andd </2%then : LA 
5 send (probe, j, k, d+1) to right (left) wen 

ere 0 Wie 
6 end ag ow "F 
7 if j < id andd = 2* then 
8 — send (reply, j,k) to left (right) 

A 4% 


9 end 
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So, once we receive a reply, then what happens? So let us say I receive reply j, k from the 
left or from the right. So if} # my ID, then what I do is I just propagate the message. So if 
this is how the reply was going, this is me, I just send it, until it reaches the ID. And 
similarly, you will have another reply coming from the other side, if this ID is genuinely 


the minimum in this window. 


So let us say that it, so this condition does not hold, which means j = id, this condition 
comes. If you have received a similar reply from the other side, so this was from the left. 
If you have also received a reply from the right, then you clearly know that this idea is the 


minimum in this entire window. So now what you do is you increase the window size. 
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So, for this window, you are able to do it. So now what you do is you increase the window 
side, size. So again, you send a probe message. So if, let us say, this is the message type, 
id is again your ID, but here is the fun part. The fun part is you increment the round number, 
you increment the round number, and again, you reset the value of the count, and then you 


send a message. 


So, if you go back, this is exactly what we were doing. And again, you come here, this is 
exactly what we were doing. Fine. So, as you can see over here, that we were looking at 
this, we were looking at the distance and all of that, and similarly, what we now did is that 
instead of a k size window, we change it to k + | size window, which actually doubled it 


on both sides. 


So ultimately, what will happen is our window will encompass the entire circle. And once 
the window encompasses an entire circle, this condition will hold true. That my own 
message will come back to me. When that happens, I will know that I am the leader and I 
will terminate. The termination criteria is similar to the previous Robert-Chang algorithm, 


which had order N* messages. 
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Leader Election in Rings 
O(n log(n)) Algorithm 


_ Analysis 


@ The maximum number of winners after k phases is: 
© To winners can at the least be 2* entries apart. 
@ Thus, the total number of winners after k phases is n/(2*+1) 
@ The total number of messages for each initiator in phase k 
is 4 x 2k 
© Total number of messages in the k"" phase is: 


Ray 
4x2 ORT 


® Total number of messages is: 


log(n) n 
M ye axxo O(n log(n)) (1) 
k=1 
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So then let us look at the analysis. Let us compute the message complexity. That is not that 


easy to do, but we can still do it. 


(Refer Slide Time: 29:59) 
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So again, coming back, what we actually ended up doing is that we just considered larger 
and larger windows, one small window, then a larger window, then we kind of double this, 
we start with this, and then we start with this. And finally, the window becomes the entire 


circle. And when that is the case, the termination criteria uphold. 
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Leader Election in Rings 
O(n log(n)) Algorithm 


_ Analysis 
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So, what we can now see is the maximum number of winners after k phases. two winners 
can at the least be 2* entries apart because within the window of 2k you cannot have two, 
two winners. One will absorb the messages of the other. So they have to at least be these 


many entries apart. Thus, the total number of winners after k phases n/( 2* + 1). 


So, if this is a minimum distance by which they are separated, then the total number of 
winners will be this and this interval + 1. The total number of messages for each initiator 
in phase k = 4 X 2*, which is simple. 2* messages this side, probe messages, 2° messages 


on the other side. And similarly, we will again have replies, 2* this side, 2* that is Side. 


So, the total number of messages in a kth phase, is this much times the number of winners 
inak - 1th phase, which is this much, which straightforward comes from here. So of course, 
it is the winners of the kth phase. But in this case, we will have to consider the winners of 
the k - 1th phase, because they are the ones who are going to participate in the kth phase, 


and round is the same thing. 


So, the total number of messages in this case, if I were to M= ~ . Sok, the maximum, 
it can go to log n, because after that 2!°8™ will become equal to n. So one of the messages 


will come back to its place of origin. So this is the maximum value of k. 
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So, if I sum this up, this is not hard at all. And so pretty much what this will work out to 
be is that this part will pretty much work out to be a constant. And what we will be left 
with is this n, and this is easily shown to be O (n log(n)). So, hence we have O (n log(n)) 
messages, that is the message complexity of this algorithm, where the difference between 


order n*and O (n log(n)) for a large number of nodes is pretty substantial. 


And this is clearly one of the ways in which we can elect a leader, by gradually increasing 
the size of the window by a factor of 2. And other thing we did is we actively suppressed 
messages in the sense that, let us say, in every window, there is only one winner. And that 
winner is the one who is going to participate in the larger window. When we double the 


size of the window, this winner is going participate. 


The rest of the nodes in the erstwhile window will not be initiators. So given the fact that 
we are suppressing messages, so this is same as saying that unless you win the quarterfinal, 
you cannot participate in the semifinal, unless you do not win the semifinal, you cannot 
participate in the final. So, we are suppressing messages. We are telling a lot of nodes not 


to send just because they lost a round. 


So, after suppressing messages, what will happen is that ultimately, very few winners will 


be left. And finally, only one winner will be left. 
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| Leader Election in Trees 


@ Let us consider arbitrary networks. > 

@ Creating a ring based overlay is difficult (It amounts to con- 
structing a Hamiltonian cycle - NP Hard ). 

@ However, creating a tree based overlay is easy. 

® To further optimize the process, we can choose the MST 
(minimum spanning tree) as the overlay. 

@ Assumptions: 

o Let the current node be termed agp) 


@ Let aneighbor be termed q } 
@ All the leaves (degree=1) are initiators 
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So, now let us consider arbitrary networks. So creating a ring based overlay is, in general, 
difficult. The reason is that if there is a large network with a large topology, how do we lay 
aring through them? So this actually amounts to constructing what is called an Hamiltonian 


cycle, which is NP hard. So this is quite difficult to create. 


Instead of that, if, let us say, we have an arbitrary number of nodes placed everywhere, 
instead of passing a ring through them, having a spanning tree, or maybe a minimum 
spanning tree is actually the best idea. We still do not know how to create a minimum 
spanning tree, but this is what we will study in the next lecture. But let us say if we could 
create a minimum spanning tree, which we all know it is easy to do, that would be really 


helpful. 


So here again, we will say the current node is p and a neighbor is termed q. And let us say 
within the spanning tree that we somehow create, how we will create, we will take a look 
at it in the next lecture, all the leaves, let us say degree = 1, these are the leaves, let these 
be the initiators. And clearly once the spanning tree has been created, every node knows 


whether it is a leaf or not. So it also knows whether it is an initiator or not. 
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Leader Election in Trees 


| Initialization 


* Wakeup all the nodes 
1 ifp)is an initiator then 
Ls 
2 awake + true 
foreach g « neigh(p) do 
3 send wakeup to q 
4 end 
5 end | 
6 while numWakeups <| neigh(p) do 
7 | receive( wakeup )g~— 
numWakeups «- numWakeups + 1 4 
if awake = false then 
awake + true 
foreach q € neigh(p) do 
send wakeup to q 
end 
end 
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So, given the fact that every leaf is an initiator, this is what we do. So, if node p is an 
initiator, we set its internal awake variable to true, which means that the node p is not 
awake. So, what it does is, let us say, if this is a leaf and the leaf is an initiator, for each of 
its neighbors, so, it will have a single neighbor though, if it is a leaf, it will send a wakeup 


message to q. 


So, as I said, Iam always p and other node is always q. So Iam p over here, the other node 
is q. So I will send a wakeup message to q. So what does this node do, which is a parent? 
So, as long as while the number of wakeup is less than its number of neighbors, what does 


it do? It receives a wakeup, and it increments non-wakeups. 


So, which means that if I have a node over here, which is getting wakeup messages from 
the initiations, as long as it has not gotten a wakeup message from one of its children, it 
will continue to receive, it will continue to increment the non-wakeups variable. 


Furthermore, if awake is equal to false, so then it will set awake equal to true to wakeup. 


And then what this node will do is that for each q element of neighbor p, for each of its 
neighbors, it will again send a wakeup message. So it will again tell its neighbors that, look, 


I am waking you up. So all of its neighbors, of course, the rest are initiators, but it could 
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have another neighbor here, let us say its parent, it will send a wakeup message there also, 


saying that, look now I am waking you up. 


So what do we do? Well, if you are an initiator, you wake yourself up, wake up your 
neighbors, if you are an internal node, and so then as long as a number of wakeup you are 
less than, you have received less than the number of your neighbors, you keep receiving 


them and you keep on waking up your neighbors. Fine. Fair enough. 
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Then what we do, is after the wake-up state, what you do is that, so this is when you actually 
run the algorithm. So the algorithm is quite simple. So we have two variables over here 
received and min p. So initially min p is my ID, which is p, and the received is 0. So for 


each internal node as long as received is less than the number of children. 


So everybody knows how many children it has, it will continue to re receive a message 
from q, so where q is the other node, so q is the sender and I am p. So I will continue to 
receive a message. And I will set this variable of mine, which is the received array, I am 


always node p, the other node is always node q. 


So for each qth with entry, I will set it to true. And I will increment the number of messages 
Ihave received. And then furthermore, what I will do is that the minimum that I maintain 


will be the minimum of the previous minimum and the value that I got from my child. 
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So, essentially what I do is after I have woken up, every child sends its ID to its parent. 
The computes the minimum of all that it has gotten from its children and its own ID, 
minimum of the entire set. And once it computes the minimum of its entire set, so these 
are all the children, and this is the parent over here. Once it does that, it sends the overall 
minimum, which means the minimum of all of its children and itself, it sends the overall 


minimum to its parent. 


So in this way, we kind of aggregate, we kind of walk up the tree that all the children send 
their values to the parent, the parent computes the minimum, the parent sends the computed 
minimum to its parent, so on and so forth, it keeps on propagating towards the route, so 
which is quite simple. This is what you would anyway expect, would be happening in a 


tree, a rooted tree, particularly, where all the traffic would ultimately flow towards the root. 
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Leader Election in Trees 


_ Send Proposal to Parent 


Collate result from the leaves and send to 
parent 
1 received + 0 
ming +p 
while received < #children do 
2/|| receive <r> fromg 
rec,[q] < true 
received + received + 1 
MiNy + Min(MiNp, 1) 
gend _ 
4 send ‘min, to parent 
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So now what happens is each child sends the value that it has gotten from its children to its 
parent. The parent computes the minimum for the entire sub-tree, then forwards it to its 
parent, which again does the same, which forwards it to its parent, which again does the 


same. Finally, the message goes to the root. 


Once the message goes to the root, we assume that the root has a special property, that the 
root is its own parent, and the root is its own child also. So the root is as such, self- 
referential, in the sense the root is its own parent and own child. So then, what really 


happens is, so, so why do we make this assumption? 


Well, this will be clear when we look at other algorithms that also work with trees, but let 
us assume for the time being that this is the way that our tree has been created, that every 
node points to its parent and the root points to itself. That tells the root that it is the root. 
So what did we do in the last slide? What we actually did is we sent the computer minimum 


to the parent. 


So what would the root do? So if I would look at the root, the root would essentially 
compute the minimum of the entire tree, which is the idea of the leader, wherever it is, and 
it will forward that to itself. So as far as the root is concerned, the root will begin execution 


at Line 1, where it will receive r from the parent, so it will receive r from its parent. 
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So the moment it is receiving a value from its parent, as far as the root is concerned, the 
process of leader election is over. And now it is, time has come to broadcast it. So what it 
will do is it will just compute the minimum, the minimum of its own sub-tree, which in this 


case is the entire tree and the value it is getting. 


But again, this line is redundant, but nonetheless, the important point is that the, that res 
contains the idea of the system wide minimum. And this will be broadcast down the tree to 
every single node. So out of this, it is possible that one of these nodes will actually be the 
leader, it will have the least ID. So every node will compare res with p, and if res matches 
p, then it means that that node is indeed the leader. So it will set its state equal to leader. 


Otherwise, it will set its state to lost. Fair enough. 


And then what every node will do is? So, let us say every node p will send the message 
that it got from its parent regarding the global minimum to each of its children, and it will 
of course not send it to its parent, q not equal to parent, q is element of neighbor, which 
makes q a child of p. So it will send it to each of its children and they will again, propagate 


the message downwards. 


So via this method the leader can elect itself as the leader. And furthermore, the message 
can be broadcast all over the tree because every parent will forward it to its children, the 


child will again forward it to its children, so on and so forth. 
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_ Analysis 


Message Complexity 


@ On every edge, we can send at the most two wakeup mes- 
sages “4 

o We can send a proposal and its reply. 

@ Atree with/N nodes as (N — 1) edges. 


Message Complexity: 4N - . O(N) 
a cathe Mi Sales’ 
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So what is the message complexity? Well, on every edge we can send at most two wakeup 
messages. We can send a proposal going up to the parent and the reply, which basically 
means with N, in an N node three, we will have N - 1 edges. So the maximum number of 
messages is 4 times this, because as I said, two wakeup messages in both directions and a 


leader proposal and a reply. So it is 4N - 4, which is O(N). 
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So what we have seen in this lecture, actually, let me go to the other, this is the last slide. 
So this, of course, are the references. We will come to them in a second. But what we have 
discussed is that from O(N*), which was a naive algorithm, we went to O(N log(N)). So 
of course, we did use a little bit of a trick here, this idea of this expanding window, but 


these two approaches were for rings. 


From here, we went to order N, but what we did is that instead of creating a ring, which 
also we argued is hard to create, we created a tree, and the tree we found is a very good 
overlay in the sense that we are able to reduce our, it is very easy to send a message, it is 


very easy to compute the minimum, it is very easy to compute a bunch of statistics. 


And in that sense, a tree is a quite superior data structure as far as a distributed system is 
concerned. Of course, at this moment, we do not know how to take a set of nodes and 
construct a tree. Specifically, a minimum spanning tree, because that would kind of 


minimize the total length, total sum of all the edges. 


You can think of it as the message sending cost. So if you somehow have a way of 
computing an MST, then you can very easily solve problems, such as leader election. As 
you can see, the complexity will go down and down, and these problems can be solved 
very easily subject to the fact that we know how to create one. So that is the subject of the 


next lecture. 


So these are the few references. So these are some very popular books on distributed 
algorithms. So they do have a lot of such distributed algorithms. So everything that we are 
doing at the moment are all asynchronous algorithms. In the sense, there is no time base. 
When we discuss other kinds of algorithms, you might find synchronous algorithms. Fair, 
let us say we, the algorithm will proceed in rounds, where it is assumed that, let us say, 


every node has sent exactly k messages each round. 


And then everybody knows when their round begins and when their round ends. So, which 
means they have some sort of a shared time base. So we will discuss such synchronous 


algorithms later. At the moment, we want to create an MST, a minimum spanning tree, 
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given a set of nodes. So that is our next problem, because we have clearly seen how such 


a tree will be useful. 
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Advanced Distributed Systems 
Professor Smruti R Sarangi 
Department of Computer Science and Engineering 
Indian Institute of Technology Delhi 
Lecture No 10 
Distributed Minimum Spanning Tree and Distributed Snapshots 


Welcome to the lecture on Distributed Minimum Spanning Trees and Distributed Snapshots. 
So, what we had seen in the earlier lecture on leader election that if we are able to somehow 
overlay a tree on top of a network, on top of distributed nodes if you can somehow overlay a 
tree instead of a ring which we have been traditionally used to we doing. So, then what we can 
do is many distributed algorithms becomes simpler. For example, electing a leader, finding the 


smallest element in the minimum finding all of that becomes much easier. 


So, in this lecture as well we will discuss one way of taking a snapshot of a network that also 
becomes easier if we have a tree. So, which tree should we choose? Well, a good tree is a 
minimum spanning tree the reason being it minimizes the length of the edges so it kind of keeps 


things close by. 
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So, we will discuss the Gallager Humblet Spira Algorithm the GHS Algorithm. We will discuss 


the overview the algorithm and the analysis and then everything about distributed snapshots. 
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Properties of an MST 


If each edge of the graph has a unique weight, then the MST is 
unique. 


Construction based on Least Weight Edge 


@ A fragment is a sub tree of a MST. 


@ An outgoing edge of a fragment has one endpoint in the | 
fragment, and one node outside the fragment. 


@ Proposition: 


If F is a fragment and e is the least weight outgoing edge, then 
F Ueis also a fragment. 


@0°: 


So, let us first look at some basic properties of an MST. So, it is important that before the MST 
is understood the following algorithms the Kruskal's algorithm and the Prim’s algorithm both 
of these are understood quite thoroughly including the groups. It is very important to go through 
both of these algorithms. The Prim’s algorithm and a Kruskal's algorithm for sequential MST 


finding and the proof of the Prim’s algorithm particularly is very important. 


So, the inductive proof of the Prim’s algorithm you should go through it. So, now I will have 
suggested a few, well so I have already suggested algorithms, but you can look at it from a 


popular text on algorithms first and the proofs are important. 
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So, now without proving, without going in a depth I will list a few properties of an MST that 
we shall use. The first is the property of uniqueness which says that if each edge of the graph 
has a unique weight then the MST is unique. So, this goes without saying this is easy to proof. 
So, this is our first point that in each edge of the graph is unique then the MST on a whole is 


unique. 


So, you will not have two MST. Of course, if you have edges with same weights then you could 
have a non-unique MST in a sense two MST with a same weight otherwise it will not happen. 
Furthermore, here is one more theorem that is a direct outcrop of the proof of the Prim’s 
algorithm which says that since call construction based on the least weight edge. So, let us 


consider a fragment as a sub tree of a MST. 


So, if we take a minimum spanning tree so let us say that this is the tree. So, let us take a sub 
tree of the minimum spanning tree and let us call this a fragment. So, let us refer to this as a 
fragment. So, then an outgoing edge of a fragment has one endpoint in the fragment and one 
node outside the fragment. As you can see over here this is the outgoing edge. So, this has one 


node within the fragment and one node outside the fragment. 


So, we basically have a sub tree then we have the rest of the tree over here and there is one 
edge that connects this fragment with the rest of the tree and so basically we now are looking 
at some property of this edge. So, for anybody who knows the proof of the Prim’s algorithm 
this theorem will be rather obvious that if F is a fragment and e is the least weight outgoing 


edge then F union e is also a fragment. 


What does this mean? That F is a fragment and this is the edge e. so if I consider this to be the 
fragment and this to be rest of the world and then I draw a line over here. So, there might be 
multiple edges that go from an edge of the fragment to the rest of the tree. So, this could be one 


edge, this could be one more this could be one more. So, let them be e, e’, e’’. 


So, what the theorem is saying just look at it if e is the least weight outgoing edge which means 
that out of all the edges that connects this fragment to the rest of the tree, if e is the one which 
has the least weight then the claim is that F union e which basically means that I can create a 
new fragment like this and this will also be a fragment of the MST in the sense that e will be 
an edge which is the part of MST that makes F union e also a fragment of the tree. So, this of 


course is easy to prove. So, let us look at this once again. 
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Properties of an MST 


if each edge of the graph has a unique weight, then the MST is | 
unique. 


Construction based on Least Weight Edge 


@ A fragment is a sub tree of a MST. 


@ An outgoing edge of a fragment has one endpoint in the | 
fragment, and one node outside the fragment. 


@ Proposition: 


If F is a fragment and e is the least weight outgoing edge, then 
F Ueis also a fragment. 


So, let us say that this is a sub tree, this is a fragment and we have one edge e to the rest of the 
nodes. And so then currently it is a tree we claim that this is the MST. Let us assume this is not 
the case well if this is not the case then what would happen then it would mean that there is 
some other edge e’ which is a part of the MST and e is not a part of it. Well, but what you 


would see is that if let us say now | add e’ this will clearly create a cycle. 


In this cycle, we know that the W(e’) > W(e) and let us say I remove e, we are claiming that 
this is the MST, but I claim that there is a contradiction the reason is that if add edge e. So, let 
us assume that the rest of the tree remains the same and its weight is fixed. So, then if I add 
edge e so let us maybe say that the weight of the fragment is W(F) and weight of the rest of is 
W(R). 
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So, let us consider two trees one that has e and one that e’. So, the tree that has e its weight is 
W(F) + W(e) + W(R), rest of the tree and the tree that has e’ its weight will be again the weight 
of the fragment plus its own weight because at that point edge e will not be there plus the weight 


of the rest of the tree. 


So, clearly these parts are common and we claim that this is the MST, but this cannot be the 
case because for this to happen W(e’ ) < W(e) which we clearly know it is not correct because 
W(e’) > and so we clearly know that this is not correct. Hence any out of all the outgoing edges 


the least weight outgoing edge which is edge e in this case has to be a part of a MST. 


And F Ve will thus become a fragment. So, this is exactly the intuition that is used in the Prim’s 
algorithm to iteratively or I should say recursively increase the size of the tree. So, what we do 
is we first consider the starting node as a single node then we look at all of its neighbors, take 
the least weight edge. So, this becomes a new fragment then again we draw another boundary 


around it then we pick the least weight edge. 


Again we draw another boundary around it again we pick the least weight edge maybe this is 
the one. So, we gradually keep on expanding the boundary and we keep on adding edges, but 
our criterion always is that for the boundary around the fragment that we have created we just 
pick the least weight edges and we just keep on adding them and given the fact that we have 


proved this theorem now the F Ue maybe I will write it in a slightly better form. 


F Ueis a fragment we are always sure that the edge that we are adding is a part of MST. So, 
if we continue to grow the tree ultimately we will encompass all the nodes and any N node tree 
will have N - 1 edges. So, when we have N - edges we will know we are done and the proof is 
by induction. Given the fact that at every step we start from an MST for that fragment and 


adding a new edge still maintains the MST property. 


We can prove that when we reach the end which is when we cover all the N vertices with N - 
1 edges a tree continuous to remain an MST. Of course, if you are able to understand what I 
said then I would suggest that you do not go forward because you will not be able to understand 


the rest. You first take a look at the proof of the Prim’s algorithm that is the most important. 


So, you first take a look at the proof and then you try to understand this theorem over here if 


this theorem is understood then only you proceed otherwise you do not. 
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GHS Overview 
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@ Initially each node is a fragment. 

@ Gradually nodes fusé together to make larger fragments. 
A fragment joins another fragment by identifying its least 
weight outgoing edge. 

© The nodes in a fragment run a distributed algorithm to co- 

~~ operatively locate the least weight outgoing edge. 

@ Gradually the number of fragments decrease. 

@ Ultimately there is one fragment, which is the MST. 
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So, the overview of GHS is like this that we want to take Prim’s algorithm and create a 
distributed version of it. So, initially each node is a fragment so initially every single node is a 
fragment. Gradually what happens is nodes fuse together to make larger and larger and larger 
fragments something that we also saw in ring based leader election where the windows kind of 


grow larger, larger, larger and larger. 


So, in this case, the fragments fuse together to make larger and larger fragments and the 
fragments of course joins another fragment why are the previous theorem which is this theorem 
which is by identifying the least weight outgoing edge. Furthermore, how to find the least 
weight outgoing edge? Well, all the nodes within a fragment run a distributed algorithm to find 


the least weight outgoing edge. 


Gradually what happens is that the number of fragments this number itself decreases ultimately 
only one fragments remains which covers the entire set of nodes and of course in this case we 
are assuming that the graph is connected and then that is the MST. So, one assumption we 
make is that of course we have unique edge weights that gives us a unique MST that is one. 
And the other is that the graph is connected. So, these are the two assumptions, but this 


assumption is made by other algorithms as well nothing special over here. 


(Refer Slide Time: 12:21) 
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Properties of a Fragment 
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Properties of a Fragment 


@ Each fragment has a unique name. 
@ When two fragments combine, then all the nodes in one 
fragment change their name to a new name. 
@ Each fragment has a level. 
@ Assume that(F)is combining with/F,.) It can only do so if 
level( Fi) < level( Fo). 
@ If level(F;) < level(F,) then all the nodes in F; take on the 
name and level of >.) 


@ If level(F;) = level(F) then the level of both of the frag- 
ments gets incremented by(1) 


@ The nodes of F; U Fp get assigned a higher level (old level++). 
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So, what are the properties of a fragment? So, now we are getting into our distributed algorithm 
not completely, but we are kind of looking at it from the outside. So, let us give each fragment 
a unique name, unique ID. So, when two fragments combine then all the nodes in one fragment 
will change their name to a new name. So, what you see is that if two fragments are combining 
to create a bigger fragment then of course a new name has to be an assigned the same way the 


two company is merged. 


So, what happens is that typically if a large company gobbles up a small company then no name 
is changed, but if two equal size companies kind of merge then the name kind of reflects both. 
We will see something similar happening with fragments that assume that fragment F1 is 


combining with fragment F2. We will see it can only do so if level of Fl < F2. 


What this means is that if one fragment F1 is joining fragment F 2 it can only do that if F 1 is 
the smaller guy is smaller or is equal. So, which basically means a bigger fragment cannot 
gobble up a smaller one, but a smaller one can always approach a bigger one or one of same 
size asking it to join. If level of F 1 < level of F 2. So, level is somehow indicative of its size 


we will see how. 


If level of F 1 < level of F 2 then all the nodes in F 1 take on the name of level of F 2. So, which 
basically means if a smaller company joins a big company like a multinational then all the 
nodes in the smaller company F 1 will take on both the name and the level. So, every fragment 
has a name and a level and the level is somewhat indicative of its size. So, if level F 1 is less 


than that then F 1 loses its identity. 
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So then the nodes of F 1 will take on the names and levels of nodes in F 2. If however is an 
important point if you can mind if however level F 1 = level F 2 then the level of both the 
fragments gets incremented by 1. So, this is important that if two fragments with an equal level 
are merging then what happens is that the level of the combined fragment increases by 1 gets 


incremented by | that is how the level increases. 


Furthermore, we will see what happens with a name so that is also interesting they get a new 
name and the new name is basically something that kind of combines the names of both. So, 
we will see in a couple of slides how that happens, but pretty much the levels are equal it is 
more complicated than when the levels are not equal. The nodes of Fl U F2 get assigned to 


higher level which is the old level plus, plus. 


(Refer Slide Time: 15:36) 
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Rules for Combining Fragments 


So, what we are saying is that look we have a small fragment, we have a big one and we are 
combining them, the combining edge is here. So, if let us say F 1 < F 2 in terms of levels then 
of course the name and level of F 2 gets transferred over here, but if they has the same level 
then the levels of both is incremental and a new name is given to both. So, it is not that it 


becomes one big homogenous fragment. 
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erie Overview 
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Combining Rules 


Let (F}./L)) be desirous of combining with (Fp, L2)./6,)is the least 
weighit outgoing edge of F; and it terminates in Fy. — 
RULE LT 


IfL; Za then we combine the fragments. All the nodes in the new 
fragment have name /F, and level, Lo. 


If Ly = Ly, and é=, = @,) The two fragments combine, with all the | 
nodes in the new fragment having: 

@ The level isl; +7) Fi. 

@ The name is é-\—s 


A 
— 2S 


RULE WAIT 


Wait till any of the above rules apply. 


Smruti R. Sarang) Assorted Algorithms 
Overview 
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Rules for Combining Fragments 
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So, we will now define two important combining rules and one waiting rule. So, let F 1, L1.F 
1 is fragment F 1 with level L 1 be desirous of combining with F 2, L 2 where e;, is the least 
weight outgoing edge of F | and it terminates in F 2. So, between F | and F 2, ep; is the least 
weight outgoing edge from F 1. So, in consonance with what we have said there are two 


combining rules one is the less than rule LT rule. 


If L 1 < L 2 then we combine the fragments all the nodes in a new fragment will have name F 
2 and level F 2. It is like a smaller guy merging with the bigger one which we have discussed 
in this slide as well. If L 1 < L 2 that is what we do that nodes in a new fragment will have the 


name F 2 and the level L 2, but if they are equal that is where we said that there is a catch. 
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If the levels are equal, then we check if the least weight outgoing edges are the same or not. 
So, this is the catch over here. So, in this case we will combine when we have the same levels 
subject to the fact that our outgoing edges are the same. If they are not we will not because the 
two fragments combine with all the nodes subject to this. So, then it is important even if their 
levels are equal they just do not combine like that only the LT rule if the levels are not equal 


then only L 1 will combine L 2. 


Otherwise, we will see that their outgoing edges have to be same least weight outgoing edges 
only then we combine and then as we have discussed the final level is L 1 + 1 and we also 
discuss that we will give both the fragments and the nodes within them a new common name 
and a new common name is basically the name of the edge. So, let us assume that every edge 
has a unique weight and unique name also and the name could be just a combination of the two 


node IDs of the edge. 


So, it could be that does not matter howsoever the name is derived, but we will essentially the 
edge will be the common name for the nodes of both the fragments F1 U F2 and if any of these 
above rules do not apply we just wait, we wait for them to apply. So, this is basically telling us 


that what we are doing is that small fragments will always go and merge with bigger ones. 


But equal size fragments will have a key condition which is that their least weight outgoing 
edges need to be the same only then they will actually merge so they will increment their level 


and the new name will be equal to the edge that connects them the least weight outgoing edge. 
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Variables 


state : sleep, find, found 
sleep The node is not initialized 
- find The node is currently helping its frag- 
. ment search for er. 
found e¢ has been found a VAC 
Status [q } basic, branch, reject 
: basic Edge is unused. 
branch Edge is a part of the MST. 
reject Edge is not a part ofthe MST. 
name Name of the fragment. 
) level Level of the fragment / 4 \ 
parent Points towards the combining edge. oc 
bestWt,bestNode,rec,testNode temporary variables 
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So, now we will discuss an algorithm it is a fairly long algorithm. So, we will have to discuss 
the state that we maintain. So, we have three states sleep, find and found. Sleep means the 
nodes has not been initiated. Find means the node is currently helping its fragment search for 
€r, ef 1s the least weight outgoing edge. Found means e; has been found or the least weight 


outgoing edge has been found. So, that is what this means. 


So, here also I am P and the other node is q. So, every node maintains an array called status q. 
So, status q is basically the status of the edge from p to q. So, status q where I am P and other 
nodes are q. So, for every q which is my neighbor I will have a status array with a qth entry. 
So, then it will have three values basic, branch and reject. Basic means the edge is unused, 


branch means edge is part of MST. 


Reject means the edge is definitely not a part of MST basic, branch and reject. Basic means as 
of now we do not know its status, branch means we know its status and we know it is a part of 
a MST. Reject means we know it is status and we know for sure that is not a part of MST then 
we have discussed name and level, name of the fragment, level of the fragment, parent so the 
parent basically says the following that let us say two nodes for the same level combine or let 


us say even a smaller node combines with a bigger node. 


So, there will always be a combining edge. So, let us consider its first combination with the LT 
rule. So, let us say that this is a small fragment and this is a much bigger fragment. So, in this 


case if this is the much bigger fragment. Every node over here will basically point to some 
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other node which points to some other node which will ultimately take it towards the combining 


edge. So, every node will point towards the combining edge. 


So, these are essentially parent pointers which take us towards what is called the combining 
edge and similarly if we have a combination of two fragments where the level was the same. 
So, every node here will also have parent pointers towards this combining edge, the common 
combining edge and every node here also will have a parent pointer that goes towards the 


common combining edge. 


So, we will see why this is the case, but this is how the parent pointers actually work and then 
of course we have a bunch of temporary variables like best weight, best node, best node and so 


on which are purely temporary variables. 


(Refer Slide Time: 22:41) 
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Initialization 


Current node p. Neighbor g. 


Algorithm 1: Initialization , 


1 oq is the least weight edge from p 
status[q] + branch 
level + 0 
slate ~ found 
fec<0) 4 
send <connect,0> to q 
h 
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So, now let us discuss the algorithms. So, the algorithms we have many algorithms 8 or 9. So, 


always the assumption is that the current node is p and the neighbor is q. 


So, let us look at the initialization. So, let us say that initialization is when things are starting. 
So, let us say Iam node p and from node p p, q is the least weight branch. So, clearly if this is 
the initiative I can draw a circle around it p, q intersects this circle and this needless to say is 
the least weight edge. So, I set the status of this edge as a branch status q as branch. | start from 


level 0 and my state is found I found a least weight edge. 


Rec < 0 means we will see what rec means, but at the moment for this case it is 0 and I send 
a connect message to q the p will send the connect message to q and this is when I am 
initializing. So, this is when I am starting I know that the level of q will at least be my level or 
we will see essentially as far as least weight edge let me send a connect message and let us see 


what q does. 


If q response, I connect otherwise I do not. So, what I send is like I send a connect 0 message 
we will see in a second what does 0 stands for, but essentially as far as I am concerned if q is 
my least weight edge I request you to kindly connect with you and I do that by sending a 
connect message and you can clearly see that in this case p, q will be a branch of MST. So, 


there is no reason why it should not be. 
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Process connect Message 


Algorithm 2: Processing of the connect message 


1 Receive <corinect/L}> from q: 
if L < level then 
* Combir wit! i LT 
2) | status [q] ~ branch 
send <initiate,levelname, state > to q 
f é 


3 end 
4 else if status /q) = basic then 


6 end 
7 else 
‘ bine with rule EQ 
8 || send <initiate,level+1,pqfind> to q 
9 end 
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Algorithms 


Process connect Message 


Algorithm 2: Processing of the connect message 


1 Receive <connect,L> from g: 
if L < level then 
* Combine with rule LT 
2|| status [q] — branch tL 
send <initiate,level,name, state > to q 
3 end 
4 else if status [gq] = basic then 


wa L > tenel 


6 end a 
7 else 

v £K..\ 
* Combine with rule(EQ) 
8|\ send <initiate,level+1,pq,find> tog —— | 

“& a +. = — 
9 end — 
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Initialization 


Current node p. Neighbor g. 
Algorithm 1: Initialization, 
1 pqis the least weight edge from p 
status[q] « branch 

level — 0 

Slate «found 

rec <0 | 

send <connect(0> to q 

hb 
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Combining Rules 


Let (F;, L;) be desirous of combining with (F>, Lo). ér, is the least 
weight outgoing edge of F; and it terminates in Fo, 


RULE LT 


If L; < Le, then we combine the fragments. All the nodes in the new | 
fragment have name Fo, and level, Lo. 


RULE EQ 


If Ly = Lo, and ef, = e,. The two fragments combine, with all the 
nodes in the new fragment having: 


@ The level is Ly +1 
@ The name is er, 


RULE WAIT 


Wait till any of the above rules apply. 
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Now, coming to algorithm 2 which is the processing of the connect message. So, when I receive 
a connect message the message type is connect and L is the level of the sender. So, in this case 
the level of the sender is 0. So, when a node receives a connect message of course you can say 
p sent it to q, but what is our convention. Our convention is that I am always node p and the 


other node is q. 


So, as far as Iam concerned I am node p, I am getting a connect message from another node 
which is q. So, this is the convention that we adopt might be confusing, but this is what we do. 
So, now what I do is that look I have gotten a connect message so I will look at my level and 
the level of the connector. So, if L > level which means that the level that is coming along with 


the message if that is less my level. 


314 


No problem this means that a smaller fragment wants to combine with a larger fragment. So, I 
can happily combine with the rule LT so absolutely no issues. So, I will set the status q to 
branch and then I will send an initiate message to the smaller fragment this is a large fragment 
I will send initiate message to the smaller fragment saying that look I have accepted your 


connect message you as of now you are initiated. 


This is your new level, this is your new name, this is your new state and a state is whatever is 
my state. If currently Iam searching for my least weight edge now you are a fragment and you 
join me so you need to help. So, whatever is my state that is currently your states you take it 
and you initialize yourself. No problem so this is sent to q. So, just a quick disclaimer before 


this point. 


This p and q business can be confusing because you will argue that look in the previous slide 
p sent a connect message to q. Now you are saying that I am p and I received a connect message 
from q how is this possible? Well, in a distributed algorithm you are looking at distributed state 
machines where every node is an independent computing entity. This is a node it gets a message 


based on that it updates it state table. 


And then it sends messages to other nodes including the one that send the message to you. So, 
that is the reason when distributed algorithm are written this might sound tricky and confusing, 
but I am always node p whoever is doing the action and whoever is sending or receiving a 
message the entire world outside is always node q. So, this of course is complicated, but if you 
can appreciate this complexity then appreciating such algorithms will become much, much 


easier. 


So, let us come back to our discussion I will clean the slide. So, the idea here was that I am 
combining with rule LT primarily because here I have a case where I have a small fragment 
that is requesting a bigger fragment to connect and the bigger fragment has no issues at all. So, 
it set status q to branch and sends and initiate message back to the smaller fragment saying that 


okay look this is my name, level and state. 


Henceforth, this is your name, level and state as well, else if this condition is not holding and 
the status of the node is basic. So, the status of the node basically has not been as far as I am 
concerned this node has not been explored so the status[q] = basic. So, then what I will do is I 


will combine with rule EQ. 
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So, why will rule EQ be useful over here and why not you know so we are automatically seeing 
that a status of the node is basic. Does it automatically mean that L > level and do the conditions 
for rule EQ actually holds. Well, you will see that they actually hold and so you will see that 
they actually hold and there is no error over here, but it is important to remember this we might 
come back to this point. So, what we do is that we combine with rule EQ. So, for this we send 


the initiate message. 


So, why EQ will come here at the moment in leap of faith, but we will break that. When we 
send a initiate message. So, we are saying that they are at the same level. So, well no problem 
so your new level is level + 1 and your new name is the edge why of which this message is 
coming the joining edge which is pq, so p on one side and q on the other side and this is how 


we are joining. 


And so, the name of the edge is p q and furthermore, given that we have joined our new state 
should be equal to find. Find basically because now we have bigger a bigger fragment. So, we 
further need to expand our fragment which means that we need to find our least weight outgoing 
edge and grow. So, your state is fine as well as my state also has to become fine. So, this is 
basically what we do that we send an initiate message to the new fragment. The new fragments 


start the process of finding. 


So, we did make certain assumptions here we have not proven, but let us continue. So, the 
application of rule LT and rule EQ would be very clear over here. If you want, you can just go 
back to this slide where we define the EQ and LT rules and as you can see it was not all that 
complicated. So, here we basically LT was the smaller fragment to the larger fragment and EQ 
was both the fragments of the same size and they have the same least weight outgoing edge. 
So, the question that we have kept open is why will the rule EQ be useful over here which we 


will gradually see why. 
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Receipt of initiate message 


Algorithm 3: Processing of the initiate message 


1 Receive <initiate,level'name’,state’> from q 


2 (level,name, state ) «- (level’ name’ state’) 
parent — q 


3 bestNode «- 0 
bestWt « x 


testNode none 7 
hy (A Vey fv 


4 foreach r « neigh(p): ( status [r] = branch) '\ (r ¢ q) do 
5 send <initiate,level’name'state’> tor 
6 end 


7 it Slate = find then 
8 recs 0 
findMin() 


Now how do you process the initiate message. So, let us say that again node p it is always node 
p gets an initiate message from node q. So, the message type is initiate we get a level dash, 
name dash and state dash. So, we set the state no problem, we set our level name and state so 
level dash, name dash and state dash. So, my level is now level dash, my name is name dash 


and state is state dash. 


Furthermore, given the fact that this is the combining edge I set my parent equal to q. So, if 
this is p q I set my parent < q then what I do is I propagate the update. So, I defined a few 
local variables best node, best weight and test node we will see what these are. So, for each of 
my neighbors for each r € neigh (p). So, basically for each of my neighbors as long as the 


status[r] = branch. 


So, essentially I propagate this along the small MST that I have created. So, along with the 
small little MST within my fragment I forward this message which means that the status has to 
be branch this means it is part of this MST fragment and furthermore, \ r # q which means I 
do not send the message back to my parent I have only sent it to my children. So, this indicates 
that Iam only sending the message to my children and I am sending the initiate message which 


means that look now your fragment has combined with some other fragment. 


So, now we are under the rule of a new fragment or let us say we all have changed our name 
and level. So, I have already done that now you do it. So, that is the reason we just send an 
initiate message with a new level name and state to let us say a child r. No problem the child r 


also does the same thing so on and so forth. What you can clearly see is that all of them 
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ultimately end up pointing indirectly of course towards the combining edge. Indirectly means 


by a parents no problem. 


Now, what we do is we see what is the state? if the state is equal to find then it means that I am 
supposed to play my job as a good member of a fragment by finding the least weight outgoing 
edge. So, I set the rec variable to 0 again an internal state variable and I call the function find 


min such that we all members of the new fragment can find the least weight outgoing edge. 
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findMin 


Algorithm 4: findMin 
+ findMin: Uroaplaed 
if 3g € neigh(p): status [q] = basic, (w(pq) is minimal) then 
2/| testNode +/q) fae 
‘send <test,level,name> to testNode 
_3 end (| — 
4 else G tinal 
5 ||, testNode < « 
| report() 
6 end 
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Now, problem find min, find min is not hard at all. So, find min by the way is not a message 
as you Can see it is an internal function. So, the internal function is being called over here find 
min. Find min this is what it says. So, again I am p the other node is q as long as there is dq € 
neigh(p) which means I look at all the neighbors of p as long as there is some q which is a 
neighbor such that status[q] = basic which means that as far as p is concerned is an unexplored 


edge and out of all of its outgoing edges. 


So, out of all of its outgoing edges that are basic w(pq) is minimal which means that all of his 
candidate outgoing edges. So, clearly if the status of an edge is branch or reject it cannot be a 
candidate outgoing edge. It can only be a candidate outgoing edge if its status is basic which 
means it is unexplored. So, if it is unexplored out of all of them I find the node q such that p q 


out of this set is minimal no problem. 


Then I say that q — test node so then I try to check if it is possible to add q to my fragment and 
kind of grow q I mean grow the fragment via this pq edge. So, what has happened is that look 
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two fragments are merged. There has been a common merging edge then initiate messages have 
been send. So, now let us say every node in at least the new fragment is aware that its boss has 


change. 


If let us say the boss was in the find state then all the nodes in the joint fragment will also be 
in a find state, so boss means the larger fragment. If both the fragments at the same level then 
what we will see in a few future algorithms is algorithm 4 in like algorithm 6, 7, 8, 9 is that the 
same level they will both of them will actually send initiate message to each other and then 
they will increment the levels in both the fragments and then they will also shift to the find 


state which means that they will start to find the least weight outgoing edge. 


So, every node will try to do its part. So, every node what it will do is it will scan all of its basic 
edges, neighbors whose status is basic find the minimum one and try to see if a connection can 
be initiated with it. So, it will do a test, so it will send a test message indicating its level and its 
name. So, name of course is the name of the fragment not its name the name of the fragment 


that it belongs to test node, test node in this case is q. 


If of course such a q is not found, then it will send test node < @ and report this back that look 
I did not find any. So, now there are two possible outcomes of find min. One is that you send 
a test message the other is that you report that look that I did not find any. So, let us see what 
happens to both. So, how did we reach here. The big picture is that look after fragments merged 


you cannot stop the process of merging. 


So, you cannot stop the process of a fragment growing. So, after you are merged to fragments 
it is a job of every single node in the merged fragment to look for further expansion that is 
where we enter the find state. In the find state every node needs to do its part which means find 
all of its neighbor with a basic status, find the minimal weight neighbor in this set and see if a 
connection with it can be initiated by sending a test message. If it does not find of course it 


should report. 
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Algorithm 5: Receipt of test Message 


1 Receive <test,level’name’> from q 
if level’ > level then 
2 | wait 
3 end 
4 else if naine = name’ then 
' J 
if status /q] = basic then 
Gtatus [q] + reject 
end x 
if 7. } testNode then 
“send /<reject> to q 
end A 
else (Y= = fy 
findMin() ). 
end 
14 end 
15 else 
16 send <accept> tog 
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Algorithm 4: findMin 


1 findMin: a Tf 
if 4q € neigh(p): ‘status [q]'= basic) (w(pq) is minimal) then 
2\| testNode —q ad 
send <test,levelname> to testNode 
3 end os 
4 else 
5 || testNode «- « 
report() 
6 end 
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So, receiving a test message again same convention I am node p I receive a test message from 
node q. So, this p, q stuff can be quite confusing in fact the first time that I read it I thought it 
was pretty challenging, but that said and done now I have gotten used to it you also will be. So, 
this is basically that I am getting a test message I am p from q and level’ and name’ is q’s name 


and level. 


So, if q’s level > my level then by the EQ rule and LT rule anyway you cannot combine it. So, 
we just wait I just keep the message in an internal buffer and I do not do anything I just sleep 
on it else if name = name’ which means my name and q name is the same which basically 


means that a message has been send to another node of the same fragment. 
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Well, then if the status is basic what I do is I mark the status to be reject. So, this clearly cannot 
be an MST because you cannot have an edge to a node in the same fragment that is now 
allowed. So, that will create a cycle in a tree. So, we know for sure that look this cannot be a 
valid tree edge. So, then what do I do this has to be a reject we need to reject this edge because 


it is to an internal node so I just reject no problem. 


Then let us say that if q is not equal to test node which means that as far as I am concerned I 
might have sent a test message to q and I would have set test node. We just look at this and 
clearing of the ink. So, whenever I sent a test message to some other node I set that node to the 
test node as the test node. So, if q # test node which means that I have not sent a message to q 


saying that would you want to join me because the status of this edge was unexplored. 


Then clearly a reject message needs to be sent to q telling you that look we cannot join because 
we are actually a part of the same fragment, but if let us say q = test node which means that 
already a test message has been sent then I should call find min again and then move to some 
other node because clearly q is not the candidate and q will also mark this edge to me as a 


rejected edge because it would get my test message. 


So, recall that when is the test node set? It is set when a test message is sent. So, the fact that q 
is a test node it basically means that a test message which is this case. So, this is not equal to 
and this is equal to. So, in this case that the fact that q is a test node basically means that a test 
message has been sent to q which means that over due course of time q will mark the edge to 


me as reject so I did not bother. 


As far as we are concerned I know that q is a part of my fragment q knows that I am a part of 
his fragment. We mutually know each other if it is this case. If we mutually do not know that I 
should make you explicitly aware of the fact that look q you should not have sent me a test 
message in the first place because you and me are a part of the same fragment. So, we can never 


have a connecting edge between us. 


Hence, I am rejecting this message. So, as I said regardless of how the message is sent either 
as a test message or as a reject q will ultimately get to know that it is a part of the same fragment 
as p which is myself and it will pretty much mark me as invalid. So, given the fact that q is now 
a rejected node what I need to do is I need to again call find min and if I again call find min 


what would happen is that in this case the status of q will not be basic anymore. 
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The status of I mean the other q, the q that we have been talking about in this slide again p and 
q is slightly confusing, but the good thing about the video is that the same slide can be seen 
over and over again to get the basic idea. So, in this case since the status[q] = basic this thing 
does not hold anymore because we just rejected it then some other node has to be picked out 


of this set. 


So, we will have a new minimum again we do the same again we sent a new test message to 
the new minimum. If it happens to be in my fragment so which means that I am a part of this 
fragment I sent the message to q. If somehow q also happens to be a part of my fragment 
unbeknown to me either it would have sent me a test message. So, with that I will get to know 


that q is actually part of my fragment. 


So, I will mark it as invalid or q will sent me a reject message and then I will mark it as invalid 
and again I will come to this point where I will start testing with some other neighbor of mine 


that satisfies this criteria and if I do not find the neighbor I will report this fact. 
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Algorithm 5: Receipt of test Message 
1 Receive <test.level’name’> from q 
if level’ > level then 
2. wait 
3 end ay 
4 else if name = name’ then 


if status [q] = basic then 


status [q] — reject 
end 
if q + testNode then 
send <reject> tog 
end | 
else 
findMin() 
end 
14 end 
15 else ye an 
16 send <accept> toq 
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So, the key point is that of course if this holds that q has a higher level I wait otherwise we are 
part of the same fragment then of course essentially what I do is I reject q the branch is rejected 
and I call find min again because I would like to explore some other neighbor of mine. 
Otherwise, if the names are unequal names are not the same if the names are not the same then 


what I do either I sent an accept message to q because there is no reason why I should not. 
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So, there is absolutely no reason why I should not and the thing is that number one there is no 
issue with the less than or equality and furthermore it is the least weight edge between the two 


fragments. 
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Algorithm 6: Process accept/reject messages 


1 Receive <accept> from q: 
testNode <- 0 
if_w(pq) < bestWt then 
2/| Bestwt “w(pq) 
bestNode /q 
3 end : 
areport()S Liynctim call 


Receive <reject> from q: 
if status [q] = basic then 
5 || Status [q] < reject 
send 
7 findMin() 
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So, after I receive an accept message from q which means that q does not have an objection I 
set so once let us say p gets an accept message from q which means that q does not have an 
objection then I set the test node to null and let us say if the W(pq) < bestWt that I have seen 
then I set the bestWt <— W(pq) and I set my best node < q and I report this fact. 


So, what I report() is that look I have found something and as far as I am concerned my best 
neighbor = q and report mind you are not a message to another node it is just a function call. It 
is just an internal function call it is not a message and the internal function call we will be able 
to see these internal variables. If I receive and also what I do if I receive a reject from q, well 


then I change the status of the edge from basic to reject and I continue with other neighbors. 


So, no problem what is the idea the idea is that look my state is find so my job is to find neighbor 
I start contacting my neighbors. In an ascending order first I look at only the basic edges not 
branch or reject edges and in an ascending order of weight I start contacting them either they 
can just hang on to my message and not reply to me. So, that is one option that they have the 
other is that they can either accept or reject. If they accept it I record this fact if they reject it I 
move on to some other neighbor of mine let us say this rejects I move on to another neighbor 


of mine. 
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Algorithm 7: report Method 


1 yeport: Ft chibhern 
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send <report,bestWt> to parent 
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Algorithm 6: Process accept/reject messages 


1 Receive <accept> from q: 
testNode «/o) 
if_ w(pq) < bestWt then 
2 | bestWt < w(pq) 
bestNode «q 
3 end 
4 report() 


Receive <reject> from q: 
if status [q] = basic then 

5 | Status [a] < reject 

6 end _ 

7 findMin() 
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Algorithm 4: findMin 


1 findMin: 
if 3q € neigh(p): status [q] = basic, (w(pq) is minimal) then 
2/| testNode —/q) 
send <test,level,name> to testNode 
3 end 
4 else 
5 || testNode «0 
report() 
6 end 
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So, what does the report method do? It is a method it is not a message it is a method. So, what 
does the report method does is that number one it looks at the set, but this is the set of all q so 
again I am p and the other node is q any other node is q. So, I look at all my neighbors such 
that the status of the neighbor is branch and it is not a parent. So, it means it is a child. So, both 
of these things together it means that the other node q is a child because it is a neighbor of mine 


and it is not a parent. 


So, it can only be a child and if rec is equal to this so basically this expression over here the 
cardinality of this set is essentially the number of children. So, if (rec = | {q: status[q] = branch 
A q# parent} | ) A the number of children which basically means that similar if you go back to 
the leader election algorithm in that we have discussed leader election in trees where all the 
children send their values to their parent and then it kind of propagates up the tree. So, this is 


basically saying that look if I have received a message from all my children and a (test node = 


o). 


So, when I do I set the test node to null? When I receive an accept message or when I find that 
none of my neighbors picked a criteria which basically means that no outstanding test message 
is there. So, let us just look at the test node there when I receive an accept I set it to a null and 
when else do I set it to null? I set it to null when I run out a neighbor so then also I set it to null 


otherwise I do not set it to null. 


If I have sent a test message and it is outstanding, then it is non null. So, the fact that here I am 
saying that test node = @ which basically means I have finished my job of testing. So, what it 


means if you go back to the tree based leader election algorithm that was in the previous lecture 
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of the slide set. So, we had said that every parent essentially finds its own minimum and also 


aggregates all the minima sent by its children. 


Once it is done all once of his children have sent a message which is precisely being captured 
by this complicated looking mathematical formula over here which just simply puts means that 
all my children have responded to me and test node = @ which again in simple layman terms 
means that I am done with my job. So, together the if statement means that my children are 


done with their job and I am done with my job I set the state to found. 


And I report as an honest child to my parent that look I am sending you the report message this 
is the best weight edge that I found. The best weight edge means as far as I am concerned for 
my sub tree this is the least weight outgoing edge it is a valid least weight outgoing edge and 
furthermore the node on the other side agrees to connect with you. So, I have an upfront 


commitment from the other node that is now going to refuse. 
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Algorithm 8: Process report Message 


1 Receive report) from q: 
if g + parent then 
it’) < bestWt then 
~ bestWt <{ w 
bestNode « q 
end 
rec ¢-(fec}+ 1 
report()) 
end ' 
else = Pareré 
if state — find then 
wait 
end 
else if 2) bestWt\then 
changeRoot() es, 
LL ? . 
else if{. = bestWt =x then | , 
a aac 
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Algorithm 7: report Method 


1 report: 
if (reg = {q: status [q] = branch \q ¢ parent }!) \ (testNode = 


») then 
2\| state found 
send <report,bestWt> to parent 
3 end 


So, then how do I process the report message? So, when I get a report message from q so again 
I am p and I am always p when I get a report message from q in this case q is my child. So, 
then similar to again the last slide of the leader election algorithm with trees we said that a 


parent with his own parent and own child so something similar to this is happening over here. 


So, if q # parent in the sense if my child is not my parent which of course holds for the root 
node. If let us say omega < bestWt that I have seen that the bestWt < omega and the best node 
<— q it means that as far as I am concerned the child that is sending me the best weight is the 
best node. So, I will record this fact and furthermore given that my child is sending me a 


message I will just set rec = rec + 1 which means that one of my children is responding. 


And then I will call the report function, the report function is the same as this which given the 
fact that we have understood this and so this will basically in this case we will check whether 
all my children have replied or not and given the fact that the rec variable I just incremented. 


So, Iam assuming all of these variables are global within the scope of a node. 


So, since I have just incremented this it is possible that the if condition becomes positive and I 
enter this. If I do not, then there is no problem I just come back and I just keep waiting. So, this 
part of the quote basically means that every parent waits for all of its children to report their 


best edges that they are finding. It computes the overall minimum report that to its parent. 


So, this is as I said what would happen in any tree that every sub tree will report its best again 
the parent will collect everything from its children which again is a root of its own sub tree and 


it will just propagate that up, up and up and up no problem until you reach the root what is the 
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root? So, the root in this case is slightly complicated we will come to it. So, otherwise if q = 
parent in a sense I am receiving a message from my parent which means my parent is sending 


me the report message. 


It is kind of strange, but we will see when that happens. So, then we will see so in this case q 
= parent. If the state is find, if my state is find I am still finding then I wait. If omega < bestWt 
which means that my best weight is actually the best then I change the rule. Otherwise, if let 


us say omega = bestWt which means me and my parent both are reporting infinity. 


This means that we have actually reach the end there are no eligible edges. So, then the MST 
condition has been met and it is all done end terminal. So, now the structure of the parent is 
quite important and see if you go back to the connect message so then the parent thing is quite 
important, but I would like to discuss change root first before going to parent because they are 


connected. 
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Algorithm 9: changeRoot() Method 
1 changeRoot(): 

if status [oestNode] = branch then 
2 | send changeroot to bestNode 
3 end ? —— 
4 else Ae 


branch send <connect,level> to 


5|| status {bestNode] ‘ 
bestNode 


6 end 
7 Receive changeroot: 
changeRoot() 
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Algorithm 8; Process report Message 


1 Receive <reportw> from q: 
if q ¢ parent then 
2 if(2 < bestWrthen 
3 bestWre 
bestNode «+ q 
4 end 
5 fec+srec+1 
report() 
6 end 
7 else 
8 | if state find then 
wait 
end 
else if w > besfWt then 
changeRoot() 
end 
else if w = bestWt = > then 


stop 
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Algorithm 2: Processing of the connect message 


1 Receive <connect,L> from q: 
if L < level then 
* Combine witt LT 
2/| status [q] ~ branch 
send <initiate,level,name, state > to q 


3 end bel 
4 else if status /q) = basic then 


6 end reel 
7 else 7 tenth 
6 


' bine with rule EQ 
8|| send <initiate,level+1,pqfind> to q 
9 end 
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Algorithm 3: Processing of the initiate message 


1 Receive <Initiate,level’\name’ state’> from q 


2 (level,name, stale ) «- (level’,name’ state’) 
parent ~ g 


3 bestNode «- 0 
bestWt « 
testNode «- none 


4 foreach r © neigh(p): ( status [r] = branch) \ (r ¢ q) do 
5 send <initiate.level’\name' state’> tor 
6 end 


7 if state = find then 
8 rece 0 
findMin() 
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So, let us now discuss the last algorithm which is the change root method. So, here what 
happens is that we have found out in the entire fragment which node is connected to the least 
weight edge. So, this has been found out why and how? Well, so basically every single node 
of the sub tree broadcasted its best values to the root and finally the root computed the minimum 


operation over here as you can see and updated the best weight and best node. 


The best node is of course one of its child nodes that has a part to the eventual best node which 
lies at the edge of his fragment. So, every node just keeps a pointer to its child and just by 
passing these child pointers we ultimately reach the edge where you have the edge that is the 
least weight edge of this entire fragment. So, now when you decide to change your root which 


basically means that you are basically rooted at. 


This node that is at the edge so what this essentially means is that if we consider all the nodes 
within this p regardless of how they were, so, this is of course similar to the Raymond’s tree 
algorithm for mutual exclusion which was there in our lecture set. So, there what happens is 
we do a change, the change root essentially means that every node updates its parent point to 


point to the node table. 


So, every node over here update its pointer along the path such that so if let us say this was the 
old root. So, all the nodes pointed to this and the old root over here point to the new root which 
is over here and this is of course the edge that is pointing to a different fragment this is F1 this 
is F2 and clearly why are these edges it is possible to reach this node for any node within the 


fragment because every node within the fragment in any case was pointing to the old root. 


So, what will happen is that it will now what we are doing is we are establishing a path from 
the old root to the new root just by flipping child parent pointers. So, what we are doing is we 
start from the root we just see the status [best node] < branch then what we do is we sent 
change root to best node and so we just keep on doing that ultimately what will happen is we 


will arrive over here. 


Then what will happen is that the status of the best along the core edge what we will do is we 
will set the status [best node] <— branch in the sense that this point the status of edge e from 
basic will turn into branch which basically means that now I acknowledge the fact that e is the 


least weight outgoing edge out of the fragment. 
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And furthermore, you will sent a connect message connect level to best node which means 
across the edge e to the other fragment. So, what did we do? The summary is that in the entire 
fragment each of the nodes looked at each of its children that are outside the fragment on 
undecided edges started from the minimum went up the ascending change if there were any 


rejects ultimately till the other side gave an acceptance that yes I am willing to join. 


And then all of this information was for the work coagulated all the way up to the root and after 
that point the decision has made what genuinely is the best then subsequently another decision 
was sent back of course the parent point are slipped and I am not showing that in the slide, but 
then after that the parent pointers were slipped. So, let us say this edge over here was found to 


be the minimum. 


So, then we set the status of this edge to the branch and then we send the connect message 


where the connect message will again take us back to algorithm 2. 


There what was happening in algorithm 2? What was happening is that we were processing the 
connect message and there of course if you found the LT condition we connected immediately. 
So, here if you would recall we have kept something open. So, we had said that the LT condition 


is not holding so what could happen. If this is not holding it means L > level which is fine. 


This means that from the outside the level is coming which is greater than equal to my level. 
So, now what I do is that I look at a status of the edge. With the status of the edge is basic then 
I wait which means that so what does this mean? This means that I have not made any decision 
about this edge. So, even if I have sent a test message on this edge I have technically not made 


a decision because I have not gotten any accept or reject. 


And so, I have not made any decision one sided, but let us see if it is not basic then I come here 
this is where we have left a question open. What we had said is that what happens here. So 
what happens here now we know. So, this basically means you reach over here number one if 


L = level. Number two with the status of the edge is either branch or reject. 


So, let us look at the reject case first I think reject eliminate. So, if the status of this edge is 
reject then there is no reason why a connect message should have been again sent by the same 
edge because the connect message is contingent on the fact that an accept was received, but 
since a reject has been receive there is no chance that a connect message will be sent on the 


same edge because it means that both the nodes are a part of the same fragment. 
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So, reject is not possible. So, this means that the status of the edge has to be a branch which 
means that from the point of view of both fragments it is there least weight outgoing edge. 
Furthermore, L cannot be greater than level this is not possible for a simple reason that for any 
connect to happen it should have gotten an accept first and if I actually look at it if you just 
look at this line over here whenever a test message is sent in this case if L > level then it just 


waits. 


So, then what would have happened is that in this case if L > level then no prior communication 
would have been initiated in the sense an accept message would not have been sent. This means 
this branch would not have been chosen and definitely a connect message would not have been 


sent along this branch. So, even L > level will not happen. 


So, the only choice that we are left is that the status is a branch and furthermore L = level 
nothing else is possible. Given that nothing else is possible what we see is that the EQ rule 
holds in this case. We were not able to save the first time when we looked at this algorithm, 


but now we can clearly see that the equality rule holds, the status of the edge is a branch. 


And furthermore the levels are equal because nothing else is possible and that is how we 
initiate. So, what would happen is for two fragments of the same level if we let us say take a 
look at their joining edge. So, let us call this node p and p’ I do not want to use p and q anymore. 
So, what would have happened is that p would have sent a connect message to p’ and got 
initiate back and p’ would have done exactly the same given that it is a branch for both p’ 


would have sent a connect message to p and gotten initiate back. 


And then what you see is in this thing that you sent the parent you set the parent to the other 
node. So, you would have a parent relationship around this edge which is also called the core 
edge that would look something like this. So, this is why I said that as far as all of these nodes 
are concerned they will direct their parent pointers up here, but here you have this cyclicity 


around this core edge. 


So, we can think of this edge as a parent and the two nodes here pointing to each other in a 
special kind of manner and all the nodes within are pointing to basically the root and both the 
roots are connected to each other in this fashion. So, now what happens is that this becomes a 
bigger fragment, but let us say if this fragment now wants to joins to another fragment then 


what happens is that let us say it finds a least weight outgoing edge. 
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So, the least weight outgoing edge can be over here. In this case this cycle over here will break. 
So, this edge will go away and what will instead remain if I want to draw it in a slightly bigger 


canvas. 
(Refer Slide Time: 1:06:26) 


Gallager Humblet Spira(GH: 


changeRoot() 


Algorithm 9: changeRoot() Method 


1 changeRoot(): 

if status [bestNode] = branch then 
2. send changeroot to bestNode 
3 end 
4 else 


5 Status [bestNode] «- branch send <connect,level> to 
bestNode 


6 end 
7 Receive changeroot: 
changeRoot() 
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So, what would essentially remain is something like this if let us say these are the two fragments 
initially what happened is if this is the core edge this is how they were pointed towards each 
other and if let us say this is the least weight outgoing edge then this parent pointer will break 


down and then a sequence of parent pointers will be used to come here. 
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So, this basically means that now as you can see all the nodes in this fragment are pointing here 
as this is the new root, all the nodes of this fragment are also pointing here because they were 


pointing to the old roots so now they are pointing over here. 


Similarly, if this is joining with another fragment depending upon the levels, depending upon 
what exactly is the level. You will have a connection that is made and so let us say that this has 
a lower level than this then of course this pointer like this and you will have a core edge 
somewhere within this fragment otherwise this edge over here will become the core edge and 


you will see such kind of a cyclic relationship. 


So, this is kind of nice, interesting and elegant yet complex. So, what we have basically done 
now is that we have looked at these special cases and we have furthermore said that what 
happens at the end. So, what basically happens at the end is that we set the change root 
messages and finally a change root happens and so gradually that is the way that your fragments 
actually expand, your smaller fragments keep joining around core edges. And they keep 


growing, growing, growing, growing ultimately the entire graph becomes a single fragment. 
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Time Complexity 


Proposition 1 


There are O(N /og(N)) fragment name or level changes. 


Message Complexity 


Message Complexity: 2E + 5Nlog(N) 
jee 


@ Every node is rejected only once + one test message and one 
reject message —— 
ry) Totalf 26 Messages 
@ At every level, a node sends/receives at most: 
@ receives: 1 initiate message —~ 
@ ‘receives: 1 accept message Gn) 
4 @ sends: 1 report message 


@ sends: 1 changeroot/connect message 
' @ sends: 1 successful test message 
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So, now a little bit of an analysis so we claim that there are O(N log(N)) fragment name or 
level changes total. We further claim that the message complexity is 2 E+5 N log (N), E is the 
number of edges, N is the number of nodes. So, what is the logic well every node is rejected 


only once correct cannot be rejected more times. One test message and one reject message. 


So, every node that is not a part of the tree that is rejected only once and that it is finished. So, 
this is limited to 2 E messages fair enough for easy number of edges. At every level a node 
sends, receives at most these many messages, how many messages. One initiate message to 
start, one accept message, one report message. So, it receives these two. So, it receives and 


initiates message and an accept message. 


So, initiate and accept is what is receives every node and it sent a report message a change root 
or connect message depending upon where it is in the tree and a successful test message. So, 
these are the five kinds of messages that it sends or receives. Furthermore, there is no 
intersection between the sending set and the receiving set. So, we can say that for every level 


these are the 5 messages that every nodes receives. 


So, let us say there are N nodes then per level we have an exchange of 5 N messages right here. 
You are welcome to verify this. So, this is just a question of simple book keeping that is all and 
so I have just one more thing to add so how many level changes will you have? So, what 
happens is that anytime a level changes anytime that a level changes we claim that the number 
of nodes it at least doubles. So, I would like to make a slightly tighter claim over here. So, let 


us go over here. 
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Proposition 1 


There are O(N log(N)) fragment name or level changes. 


Message Complexity 
M N 


lessage Complexity: 2E + S5Nlog(N) 


/ 


@ Every node is rejected only once -+ one test message and one 
reject message 
® Total; 2B) messages 
© At every level, a node sends/receives at most: 
@ receives: 1 initiate message f 
@ receives: 1 acceptmessage ww) x L09 N 
@ sends: 1 report message ) ; 
@ sends: 1 changeroot/connect message 
@ sends: 1 successful test message 


So, the claim that Iam making is that any fragment with level L with let us say level L has at 
least 2" nodes. So, this is trivially true if L = 0 it is trivially true because why 2° = 1 and every 
node by itself has level O so this is trivially true. So, now let us consider a mathematical 


induction based proof. So, let us assume that till level L this holds. 


So, basically for all levels for all L or let us say 0 between 0 and L this holds. So, let us now 
consider level L + 1. So, how do you go to level L + 1? You go to level L + 1 only when two 
such fragments combine otherwise we remain at the same level and then only is smaller 
fragments come and keep on joining you which is fine. So, then this induction hypothesis will 
still hold. So, now if let us say two fragments of the same level are combining then this will 


have 24 nodes this will have 2" nodes. 
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So, total we will have 24 + 1 nodes. So, that is the total. So, again as we can see in the bigger 
fragment whose level is L + 1 the number of nodes that it has again at least 24*1_ So, the 
induction hypothesis does hold. So, this means that the base case holds and the induction 
hypothesis holds. Hence by induction every time I increase the level the number of nodes at 


least double. 


So, this further means that with N modes the maximum number of levels that we are going to 
have is log2N. So, with armed with this information so let us go back. So, given the fact that 
we will at best have log N level changes and per level change 5 N messages are sent. So, the 
total number of messages are 5N X logN. So, it is basically 2E + 5N logN is the total number 


of messages that we are looking at. 


So, it is essentially two times the number of edges + 5N logN that the number of nodes so that 
is what we are looking at. So, in terms of complexity this is not that bad at all in the sense we 
are able to take a very large distributed network and create an MST with Nlog N message 
complexity which is quite good and in the space of distributed algorithms of course this 


algorithm is complex, but now I hope that most of it is well understood. 
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2] Distributed Snapshots 
@ Chandy-Lamport Algorithm 
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So, now we will discuss something called Chandy Lamport Algorithm in one or two slides. It 
is called a distributed snapshot. So, the idea is that fine I created a large distributed algorithm 


so what? 
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¢ Chandy-Lamport Algorithm 
Distributed Snapshots (poderibne 


Overview 


wo 


@ Every process can take a local snapshot. 
ei. dies al 


@ The process does not process any message while taking a 
snapshot 


Consistent Snapshot 


If a message receive event is part of a local snapshot, then its 
send event should also be part of a snapshot. 


@ The channels are FIFO 
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So, the idea is that in a large distributed system if let us say we have a large number of processes 
debugging this entire system is difficult because they are sending so many messages. So, let us 
say even if we give students a homework to implement the MST the minimum spanning tree 
algorithm. So, even then also debugging it is hard. So, what we do is that every process takes 


a local snapshot of its state. 


So, snapshot is basically I want to capture a photograph of the entire system such that if 
anything is wrong with it I can analyze the snapshot and find out what is wrong, but even taking 
a photograph of a distributed system where there is clock synchrony is hard. So, we are looking 


at one way of doing it. So, as I said the algorithm is that every process takes a local snapshot. 


Furthermore, the process does not process any message so of course these are different 
processes it does not act on any message while taking a snapshot. So, what we want is we want 
a consistent snapshot in the entire distributed system such that we can act on it. So, what this 
basically means is that if there is a sender and there is a receiver and let us say the sender sends 


a message. 


It should never be the case that the receiver is taking a snapshot of its state where it is recording 
the message received, but in the sender’s snapshot the message send is not there that should 
never be the case. So, if let us say the receive event is there the send event should be there that 
is the only requirement for consistency. There is no other requirement pursue. So, if let us say 


that in a distributed system we were to take such kind of a snapshot. 
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It would at least give us some kind of a photograph which of course is not instantaneous, but 
might give us enough information to debug and find the source of a problem. So, let us do that 
so we will use the Chandy Lamport Algorithm which is very simple. The only assumption it 
makes is that we have FIFO channels FIFO is first in first out channel in the sense A sends a 


message to B the messages are not re-audit. 
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Algorithm 


Algorithm 10: Chandy Lamport Algorithm 
1 Initialize: 

take local snapshot (— 

taken < true _— 

foreach q € neigh(p) do 
2\| send <mkr> tog 
3 end 


4 Receive <mkr> : 
if taken = false then 
5 | | »take local snapshot 
taken + true < 
foreach g € neigh(p) do 
send <mkr> tog 
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So, the algorithm is like this that we take a local snapshot set taken to true. For each of our 
neighbors we send a marker to each of the neighbors. So, then what does that happen when a 
marker is received if taken is equal to false if a snapshot has not been taken then the neighbor 
takes a local snapshot and it sets taken to true. Again for each of its neighbors it sends the 


marker. So, one thing that is clear is that let us say if I have a system like this. 


So, let us say I take a snapshot I send a marker to these three nodes, each of these nodes then 
take a snapshot and then they send a marker let us say the marker is send here, here and here, 
but let us say this marker finds that actually this node has taken a snapshot so the marker is 
ignored. So, you take a snapshot only once when you get the marker for the first time and after 


that after the snapshot marker is done you do not do anything. 


And you just record the state and that is it I mean either you can stop there or you can wait for 
another message to ask you to resume. So, that is a separate matter we will get into that slightly 


later. 
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¢ Chi amport Algorithm 
Distributed Snapshots elites tAger 


Analysis 


The algorithm terminates in finite time. 


If a message(p — q) is sent after a local snapshot, then it is not 


a part of the receiver's (gq) snapshot. 
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So, the theorem | is the algorithm terminates in finite time why not because you are sending 
messages ultimately the message will reach everybody and since every node takes a snapshot 
when it receives the first marker message it will terminate. The most important is theorem 2 
that is regardless of the distributed system if I do this what I am saying 1s that the entire snapshot 


is consistent. 


What does the entire snapshot consist of? The entire snapshot consists of the individual 
snapshots of all the nodes whatever has been taken that is what the entire snapshot consists of 
and why do I say it is consistent I claim that if a receive has been logged in the snapshots its 
corresponding send has also been logged that is my only requirement no other requirement. So, 


the theorem is that if I message from p to q is sent after a local snapshot. 


Then it is not a part of the receivers snapshot. So, what I am saying is that let us say there is 
node p and there is q. So, let us say it takes a local snapshot and then it sends a message. So, in 
this case we do not stall after taking a snapshot, but we send a message. So, then what I claim 
is that so in this case the send has not been logged because I take a snapshot first and then I 


send so I did not logged the send. 


So, what I claim is that at the end of the receiver the receive will also not be logged and the 
answer is very simple, the proof is very, very simple. The proof is that when I take a snapshot 
I immediately sent a marker message to q. After the marker message I sent my other message 
this means that q either gets the marker from me or from somebody else before me and takes a 


local snapshot and only after that does it get the message m by that time q has taken its snapshot. 
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And it has not recorded the receive consequently this is correct and consistent as per our 
definition. Given the fact that we did not log a send, we did not log the receive also this was 
our simple definition of consistency and as you can see this simple algorithm does provide our 
definition of consistently what is it? If a receive has been logged in the snapshot that is that 


implies the send has also been logged. 


So, what we were actually able to prove was the contra positive of this that if a send has not 
been logged then the receive has also not been logged which is what we were just able to prove 
over here and this is the same as this. So, anything contra positive is basically a implies b is 
essentially the same as not being implies not a. So, this is essentially proved by contra positive. 
So, this is what we have done and so we have one logarithm of at least recording a consistent 


snapshot in a distributed system. Why are we introducing this here? 
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Well, the reason we are introducing this is for a simple reason that the MST algorithm was 
complicated no doubt. Given that the MST algorithm was complicated when we actually coded 
on a distributed system you will have many corned cases and the core is not going to work and 
then this entire p, q business is going to confuse you. Furthermore, what happens so I will just 


show you one of the slides which makes our life tough actually. 


So, it is essentially slides like these. So, for example, this so where we wait the moment we 
have a wait if the code is not written correctly you might be waiting forever it might be infinite 
wait and so these wait messages are essentially what kind of make us quite jittery. So, these are 


things that we do not like. See here also there is one more wait, so we do not like these things. 
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So, given the fact that we do not like these things what will happen is that in most cases the 
code will not complete because the processes will just end up waiting because of some bug 
somewhere. So, the debug such systems and find out what exactly has gone wrong we can 
create a nice summary of all the actions that a given node has taken. We will call it the snapshot 
of the node and use the Chandy Lamport Algorithm to record a consistent snapshot, consistent 


as per our definition. 


And then this snapshot can be written to maybe the disc by all the processes then subsequently 
this can be analyzed either manually or via script to find out what was the most likely cause of 
the error that is the reason why this lecture combines a complicated algorithm with a very short 
and small and cute algorithm to effectively debug a distributed system because debugging a 


distributed system is hard. 


Now coming to the references the book by Gerard Tel has many of these details and so there 
are many other distributed algorithms as well in this book including a lot of concepts, so you 


are most welcome to read it and also implement the MST algorithm to get a practical field. 
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Welcome to the lecture on the famous FLP result. So, the FLP result, the credit goes to three 
people, and I will tell you in a second, what the result is. But first some honour will give to the 
people who have arguably discovered the most important result in Distributed Systems. So, the 
people who deserve the credit are Michael Fisher. So, that explains the F Nancy Lynch for L 
and Michael S. Patterson for P. 


So, the FLP result basically says that to solve the consensus problem in a distributed setting. 
Even if we have one faulty process, it is not possible. So, the paper is titled Impossibility of 
Distributed Consensus with one faulty process. So, we will discuss in a second, what exactly 
Distributed Consensus is and why exactly are we so bothered about it? So, let we first discuss 


what is consensus and then what is distributed consensus? 


So, consensus basically means that there are a set of process and each process proposes a value. 
So, let us say that they propose an Integer. So, one process, one of these circles, which is a 
process, one of them proposes 4, then 9, let us say 3, 2 and 5. So, consensus basically says there 


is two properties. So, the first property would be one among the proposed values. 
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So, one value among all the proposed values is chosen. So, it cannot be a separate value. So, 
one among them is chosen. That is, and second one is that everybody agrees. So, this can be 


thought off as an agreement problem as well something that everybody agrees. 


So, why is this problem so important? Well, the problem is important basically, because if we 
consider each process to be a distributed process and we consider all of this to be a distributed 
system, then what it basically means is that we want to make a set of distributed processes, 
agree on some common thing on something of common interest and will in a second, see what 
they are, but basically it is an agreement problem where there is no centralized entity. All that 


we can do is we can send messages between processes, the way that we have been doing. 


And we also know that messages may get delayed and that to get delayed indefinitely for a 
very long time. So, in such a noisy environment, we want to ensure that consensus holds, which 
means one among the proposed values is chosen and everybody agrees. So, why is this such a 


big deal? 


(Refer Slide Time: 04:09) 


So let us look at two examples of why this is such a big deal. So, let us go back to one of our 
problems, which was leader election. So, leader election, what were we doing? We had a set 
of nodes. Each node said that it is the leader, or at least wanted to be the leader. Finally, we 


had an election between them basically by sending messages between the nodes. 


And ultimately one node was chosen as the leader. And everybody else agreed that this node 
should be their leader. So, this is in a sense exactly the consensus problem, that one among the 


proposed values is chosen and everybody agrees. So, as you can see, there is complete 
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agreement. So, agreement is there no doubt. Furthermore, one among the values is chosen and 


leader election is definitely an instance, a specialization of the general consensus problem. 


So, let me give another example. So, let us say that, I am the user and I make a credit card 
transaction to buy an airline ticket. So, that automatically involves the bank as well. That 
automatically involves the travel agency, the site that I am using to buy the ticket. And that 


involves the airline as well. So, the airline as well is involved. 


So, the point is that all of them, all these five entities need to agree on one thing either they 
agree to issue the ticket all five of them. This basically means that if my money is deducted, 
then the ticket should be issued and I should have the ticket with me. It should never be the 
case that money has been deducted from my bank account or my credit card account. But my 


ticket is not there with me that should not happen. 


And furthermore, it should also not be the case at all, five of these entities, which means me 
the credit card, the bank, the travel agency, and the airline, all five of us agree that I should be 


given the ticket, but ultimately it is not issued. 


That is also not allowed because a consensus is to choose one value among the set of proposed 
values. You cannot parachute a value from somewhere else. So, which means that, a broad 
agreement, consensus like agreement is required. So, these are also called transaction commit 
systems, which we will study later, which form the core technology of most banking, finance, 
almost everything to do with Distributed Systems. This basic transaction processing system, 


even in large databases is required. 


Again, what is the core problem? The core problem is a consensus problem. So, similarly it has 
been identified that a large number of problems in Distributed Systems and Concurrent Systems 
are essentially specializations of the mother problem, which is the consensus problem. As a 


result, the consensus problem has a very special place. 


Now, when we want to achieve a consensus, some sort of an agreement in a real-world scenario, 
we will have faults in the sense we might have network links that die. We might have nodes 
that die. We might have indefinite delays, or we just might have slow machines. We just might 


have a machine, which for some reason is congested is taking too long to reply. 


So, in such a scenario, the question is, can we achieve consensus? Can we achieve an 
agreement? Can node decide on something? So, the answer is, it is an unfortunate answer, but 


the thing is even if you have one faulty process. So, we will formally define what a faulty 
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process is. But the idea is that even if you have one faulty process, the answer is no, it is not 


possible. 


And this is the famous FLP result, which is kind of the underpinning of most consensus systems 
as of today. And the sad part is that still it is not doable. Nevertheless, a lot is done to kind of 
get close, which we will discuss in subsequent lectures, but we are starting from this summer 


note over here. 
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So, given the fact that we have described the consensus problem, let us now describe our system 
model that under what kind of assumptions are we trying to solve the consensus problem? So 
of course, we assume that we have N processes. Between any pair of processes see any pair of 


processes can communicate. So, there is a channel between them. 


So, we do assume reliable message delivery in the sense messages are delivered correctly. Their 
contents are not modified, but it is just that messages may get indefinitely delayed. 
Nevertheless, no message is ever dropped. I mean, unless the node is faulty, so, we will come 
to that. So, the issue is that we have in our system among the processes, we have regular 


processes, normal processes, or non-faulty processes. 


So, non-faulty process, what it can do is that it can take an infinite number of steps in the sense 
that as long as messages keep coming, it can process them as compared to that, a faulty process 
can take a finite number of steps and then it will simply stop responding. So, you can say that 
it is slow or it is dead. So, we have no way of differentiating. So, we cannot differentiate 


between a slow process and between a slow and failed process, a process, which has failed. 
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So, it is not possible for us to differentiate between them. So, as far as we are concerned, the 
process is slow, but whether it is slow or whether it has actually failed, that we do not know. 
So, given the fact that we have this, and then of course we have another thing about the channel 


the NOC is that we can have out of order delivery in the sense that we do not have the FIFO 
property. 


So, we have out of order delivery of messages in the sense that if one message was sent, it 
could be delayed for a long time. So, that assumption is something that we make. So, these are 
standard assumptions where we have N processes and we have faulty processes, but, at max 
one at the most let us call it at max one faulty process, the rest being non-faulty, we do have 
reliable message delivery in a sense, messages are not corrupted. It is just that they can get 


delayed. 


And for a certain process, the faulty one, we cannot say that it is slow or it is dead or something. 
And the same holds for others as well, that there is no time base. So, we cannot say, if a process 
is just slow or slow to respond or dead or it is not possible for us to say, so given that we have 
that, we should now look at the process model. So, the process model in a distributed system 
looks something like this. So, every process in a distributed system gets a set of inputs. And 


then it produces outputs. 


So, it has an internal state. So, as far as we are concerned, what the process basically does is, it 
is kind of like a finite state machine. So, the finite state machine essentially takes in a message, 
changes the internal state. And then in response, it can send messages to a few, a few other 


nodes. And so, the outputs are basically output messages that are sent to other nodes. 


So, then the process goes back to sleep. So, it is basically a finite state machine that is woken 
up by a message. In response, the internal state changes and other messages are sent out to 
other processes. So, that is basically the model that we have. And then of course we define the 


notion of a step in such a process. 


So, the notion of a step basically means that we can either receive a message from the network 
in a single step, or we can send an arbitrary number of messages so we can receive a message 
and is assume that we instantaneously update the state or we can send messages. So, messages 


can be sent. 


So, when messages are being sent, the atomic broadcast capability is assumed. So, the atomic 


broadcast basically says that if one non fault process receives the message, then all non-faulty 
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processes will ultimately receive the message. It is just that we are talking of eventual delivery 


of the messages. And we are not really saying that they will be delivered by this time. 


So, messages needless to see can be delayed. So, they can be delayed for sure. And furthermore, 
these delays also can be arbitrarily long and these can cause the entire algorithm to wait 


indefinitely in the sense, wait for a very long period. 


So, given the fact that we have seen this fair message delivery times, and even no response 
times are not deterministic are not bounded. Here as I said, out of N processes, one of them 
will be faulty. So, faulty basically means that within the scope of the algorithm, or let us say in 
the protocol for all possible inputs, it will only take a finite number of steps. And at one point 
it will stop. It will just not take steps to take steps basically means either receiving a message 


or sending a message. 


So, receiving a message automatically includes processing it and changing the internal state. 
So, that is the reason I am not talking about changing the internal state as a separate step, but 
essentially the step is receiving a message from the network from some other node and sending 
messages. So, both of these things might get delayed. So, that is the reason a node may appear 


to be unresponsive to the rest of the system. 
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So, given that, that is happening, let us now look at our consensus protocol. So, in the consensus 
protocol for every process, this is a model that we assume. So, we assume that it is input is a 
single bit register Xp. So, the single bit input can either be O or 1, which pretty much 
corresponds to the value that is being proposed. So, the value that is being proposed is 0, this 
could be 0, or it could be 1. Then of course we have some internal state, which can be as much 


as you want. And then there is a output. 


So, in this case, this output is a final output. It is not the messages that we send. So, the final 
output is what we decided. So, this is essentially a process as a part of the consensus algorithm, 
what it does. It takes in an input since as many messages as you want. Finally, the output has 


three possible values. The first one being the most interesting. So, b means undecided. 


So, undecided means that I have not decided on O or 1. So, as far as I am concerned, the 
consensus has not been reached and 0 or 1 are the regular 0 and 1, they mean that a consensus 
has been reached. So, what our aim will be or what we want to show that if there is one faulty 
process, which takes a finite number of steps, that we have defined it and then appears to fail 
the, all the processes, at least the non-faulty ones, they will appear to be undecided in the sense 


they will not be able to make a decision. 


So, what we further say is that we need not. So, also the other thing is this is the right ones 
register in the sense that we cannot update its value. So, now given this process model where 
we know what the inputs are and what the possible outputs are, let us look at how exactly 


processes will communicate. So, processes will communicate by sending two kinds of 
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messages between them. One is send (p,m). So, in this case, p is a process is a process ID rather, 


and m is a message. 


And similarly, we have received p that is other function that processes can call. So, this 
basically means that we receive a message from process p say again, p is the process ID. So, 
these are the two basic functions that processes will call. And then let us say that after receiving, 


we get what is called an event, an event is basically from a given process and a message. 


So, it is basically a process ID and a message pair. So, this event is applied to the internal state 
of the process based on this, the internal state changes to a new internal state. And furthermore, 


this can lead to messages being actually sent. 


So given that we have seen this, we are now in a position to define something called the 
configuration and the configuration is a key point in our algorithm. So, it is one of the most 
important inputs that we will use. So, configuration is basically a configuration of the entire 
system and what the configuration essentially contains is that it contains the following. So, it 
essentially contains all the outputs, a union of the internal state of the system, for all the 


processes. 


So, the internal states, and along with that, all the messages in flight. So, basically all the 
messages. So, this to us is a configuration. And as we have seen, every process has an input, 
has an output and an internal state. If I just take a union of this across all the processes. So, that 
would include their outputs. If they have decided something or not their internal states and 
whatever messages that are currently in flight that are yet to be delivered. If you look at that, 


we will call this configuration. 


So, every time a certain event, and event is the same as essentially a message with the process 
ID, along with it, every time that this is appeared, this is applied to a node, which means a node. 
This message is sent to a node, the node will of course, update its internal state and send 
messages. And this event will also cause the configuration to change the global configuration 


to change, the way that we have defined. 


So, when I look at the global configuration, what I can see is that the configuration is at, as it 
is, but once let us say a message is sent to it, so this message is this event over here, then I will 
apply this event to the existing configuration to get new configuration C’. So, this is known as 
applying. And if I apply several events in the sense I present these events to the processes, so 


then it will be some structure like this. 
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And finally, we will have some configuration, let us say Cn. So, the important thing is when I 
am applying a series of events in sequence, we can call this an event schedule is essentially a 
finite sequence of events. So, we can replace this sequence over here with the term Sigma. And 
we can just say, sigma C is Cn. So, basically Sigma is essentially a sequence of events that is 


applied to a configuration to get a new configuration. 


So, as we can see, given a configuration for different combinations of events, we can arrive at 
all kinds of new configurations. So, all of these configurations that are arrivals from C are set 
to be reachable configurations. So, reachable or alternatively they are also called accessible 


configurations. 
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So, this basically means that I take one configuration C of the entire system, and then I apply 
all kinds of events to it. Whatever I can reach is my reachable set or accessible set. So, given 
that I have this, we are pretty much at a position to make to define our final correctness criteria. 
So, at this point, what we can do is, so, what have we looked at up until now, if I were to 


summarize, we have looked at process. So, specifically we have looked at their one-bit input, 


the output, which also includes the undecided state that is very important. 


The internal state, and of course, a combination of all of these, essentially a union of all of these 
across all the processes, which leads to a configuration and the way in which if you can apply 
events to a configuration, it will become another configuration to apply the event sequence 


Sigma, to see so it will become some other configuration C’. 
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So, now we are in a position, as I said, to provide some definitions that will take us to our final 
notion of correctness. So, the first thing that I would like to say is that a configuration C has a 
decision value V if some process has already made its decision. So, let us say within the 
configuration, there is some process p whose output is already a 0 or 1. It is not undecided, and 


it has finalized the output. 


So, finalized output, meaning that is a result of the consensus. See if this has been finalized, 
we Say that the configuration C has decided and whatever output any one process has decided 
that will be the decision value of the consensus, of that configuration as well. And clearly 


multiple processes cannot decide different things. 


So, now let us use this fact to define what is called a partially correct execution. So, partially 
correct execution has two properties. The first property is that no accessible configuration has 
more than one decision value which means that if I start from an initial state, then no accessible 


configuration has more than one. So, this is something that we just said. 


So, what we basically said is that, look, if let us say a configuration has more than one decision 
value, then there is clearly no consensus. So, that is something that is not allowed. So, this is 
property | and a second property is like this, that let us say, if I start from the, from an initial 
configuration, then there is some accessible configuration that has decision value, 0. And there 


is some accessible configuration that is decision value, 1. 


So, this basically means that it is not a fixed match in the sense that when I am starting, I have 
access. Some configuration is reachable, which is going to decide 0. And something is 
reachable, which is deciding 1, which basically means that I am starting with an open mind. I 
am not committing myself to any particular outcome. I am simply saying that, look, this is 


where I start, this is my initial value. 


And essentially all configurations are accessible, accessible in the sense I have path of events 
exists that can take me from the initial configuration, let us say C to some other configuration, 
let us say Cx, which decides 0 and some other configuration, let us say Cy, which decides 1 in 


a sense, both are possible. 


So, both of these are essentially saying the first one is a clear-cut consensus condition that in 
multiple things cannot be decided. And the second one is more like a sanity check that it is not 


a, it is not like a fixed match. So, when you begin, you start with an open mind that either you 
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can decide 0, or you can decide 1. So, given that we have this, we can extend a partially correct 


execution to what is called, totally correct execution, which is something that we are after. 


(Refer Slide Time: 29:24) 


Let me write it down. So, totally correct execution does take in the notion of a faulty process 
and a non-faulty process. So, as we said, a process is faulty, if it takes a finite number of steps, 
in the sense it, it does something, something, something, and then abruptly stops or let us say 
stops for an infinite amount of time or for a very large amount of time, such that everybody 


else concludes that it is virtually dead. 


So, we will see what exactly this means in the context of our proof. But essentially the idea is 
that for a finite number of times, it takes steps and then it basically stop. Whereas, a non-faulty 
process, every time you send it an event, it will take a step in a sense, receive it and then send 


additional messages or something. So, it will take a step the way that we have defined. 


Furthermore, we define the notion of an admissible run. A run is basically a sequence of 
configurations where we just present messages to them. And then we move from configuration 
A to B, to C to D and so on. And so, a run is the same as let us say the run of a program where 
you start the protocol and the protocol runs. And the way that the protocol will run is that it 
will jump from one configuration to the next, to the next, as we are generating more events, the 
nodes are consuming them and changing their internal states would be moving from 
configuration to configuration and because messages are being sent and received. So, this is 


called a run. 
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So, the admissible run is when basically all non-faulty nodes, get the message, get the message 
eventually. So, of course a message may be delayed, but eventually all non-faulty, nodes will 
get the message. So, for them, no message is dropped. So, this is an admissible run. So, what 
it totally correct protocol is, is that totally correct is essentially partially correct plus little bit 
more. Totally correct is basically partially correct. Partially correct means that it makes sense 


in the sense that consensus is achieved as we have seen over here. 


And furthermore, we start with an open mind. So, we do not commit ourselves to one outcome, 
0 or 1, and furthermore, every admissible. So, this is, let us say first clause in the second clause 
that every admissible run. So, admissible run is essentially a run of the protocol where all non- 
faulty nodes will get a message. So, for them, no messages dropped. So, every admissible run 


is a deciding run. 


So, what this basically means is that if for every admissible run of the protocol, we finally reach 
a configuration that decides either decides 0 or 1, we do not really care, but finally, we do reach 


a configuration that decides the outcome, whatever it may be. 


So, in the presence of one faulty process, this is what a totally correct execution to us means 
that the basic principles of consensus do hold, basic sanity checks hold, and furthermore, every 
admissible run is a deciding run in a sense that every time we run the protocol, we early 
ultimately end up with a decision. So, what we would like to prove is that our protocol is not 
totally correct in the sense every admissible run is not a deciding run, which means that even 


if you run it for an infinite number of steps, ultimately we will not be able to decide. 
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So, our aim, if you would look at it in this proof is basically to prove that our protocol is not 
totally correct, because if something is totally correct, then it means that it is deciding on every 
single run, it is totally decisive, but we want to prove the reverse that our protocol is indecisive. 
So, we want to say that our protocol is not totally correct. That is essentially what we would 


like to prove. 


So, for this, we will prove a sequence of three lemmas, which will say that look, if you start 
from an indecisive state, regardless of how many messages you exchange, you will always 
have an in indecisive states in the sense your protocol. It is possible that for certain runs, 
regardless of how much you run you will never be able to make a decision. So, you will never 
be able to computer consensus, in the presence of one faulty process, which basically means 
that a protocol is not totally correct, because in this case, every admissible run is not a deciding 


run. So, ultimately, we are not deciding anything. 


So, to proof this, we will essentially divide the proof into threes lemmas, lemma 1, lemma 2, 
and lemma 3. They can take a look at the paper. So, this is exactly what we are going to do, 
that we look at two simple lemmas and then a third one. So, the first lemma is like this, and it 
is fairly straightforward. So, here is what it says. So, what it says is let us start from some 
configuration C and apply to schedule the events? So, let us apply Sigma 1 or let us alternatively 


apply Sigma 2. 


So, let these lead to configurations, C1 and C2, respectively, furthermore, all the processes that 
take steps in Sigma 1 and all the processes that take steps in Sigma 2, let them be mutually 


disjoined, or let us say the function P (o,) M P (oz) = @. So, essentially what it means is Sigma 
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1 works on one set of processes and Sigma 2 works on a different set of processes. So, if that 
is the case, then what can be done is that we can apply Sigma 2 to Cl and we can apply Sigma 


1 to C2. They will ultimately reach the same configuration C3. 


Because as far as C3 is concerned, the relative order of Sigma 1 and Sigma 2 does not matter 
because the processes themselves are disjoined. So, it does not matter. There is no exchange of 
information between them. So, that is the reason here this appears to be commutative in the 
sense you either apply Sigma 1 first and then Sigma 2, or you apply Sigma 2 first and then 
Sigma 1. It does not matter. You ultimately reach the same point. So, this basically means that 


the processes are disjoined. 


So, disjoint processes in a sense imply commutativity of operations. So, the operations 
commute, they can be replaced. So, we will remember this, this is a very important result and 


we will take it forward. 


(Refer Slide Time: 37:58) 
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So, now let us come to the second lemma which is Lemma 2. So, prior to describing the lemma, 
let me describe, or let me initiate the definition of one more term. So, this is a simple term. So, 
let us say that C is a configuration. And let us say from this configuration, we can reach a 
configuration which decides 0. And we can reach a configuration that decides 1 in the sense 


that this is an open-minded configuration. So, we call this a bivalent configuration. 


So, this means that starting from here, I can reach either a state that decides 0 or that decides 
1. So, this is called bivalent. In comparison, we can define a univalent configuration where, 


regardless of the configuration transitions that you make, it is guaranteed that you will always 
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decide either 0 or 1 in the sense a decision has already been made and the decision is not going 


to change. So, that is a univalent configuration. 


So, lemma 2 basically says that any protocol, a bivalent initial configuration exists, initial 
configuration exists. So, let us assume to the contrary that this is not the case. So, if this is not 
the case, what it means that from the definition of partial correctness, all the initial states are 
either 0 valent or 1 valent . So, if you do not have a bivalent state, then all so amusing state and 


configuration interchangeably. 


So, it basically means that when you are just starting the protocol, what we said from partial 
correctness 1s that it should be possible in some runs to decide 0, in some runs to decide 1, ina 
sense, it should not be a fixed match that we always decide 0 or always decide 1. So, then of 
course there are two ways that this can be interpreted, one way is that our initial configurations 
itself are bivalent, in a sense, depending upon who sends what messages we can either proceed 


towards ultimately deciding 0 or deciding 1, that is one way of looking at it. 


But let us say that in this case, this is what Lemma 2 is trying to prove that bivalent 
configuration exists, but less assume that it does not exist. Then another way of looking at 
partial correctness is that the initial configurations itself are either 0 valent or are | valent. Fair 
enough. So, if initial configurations itself are 0 valent and 1 valent. So, this will basically mean 
since we are assuming to the contrary and we want to prove by contradiction. This will 


essentially mean that from partial correctness, that all the initial configurations are univalent. 


And furthermore, there has to be at least one such configuration, which decides 0 and at least 
one such configuration that decides 1 in the sense, all of them cannot be deciding 0 or cannot 
be deciding 1. So, given this let us look at what an initial configuration is. So, when we are 
starting the protocol, we are not sent any messages. There is no internal state, nothing. So, the 
only thing that we have in the initial configuration is basically the inputs. That is the only thing 


that we have. 


So, if we have n processes, so, we can say that the initial configuration comprises an n input, 
array where the Pth entry is xp. So, if you recall, every process takes a one bit input and this 
can be the value that it will propose. So, basically if you have n processes for each process, we 


have a one-bit input and that is all. So, that is the only thing we have in our initial configuration. 


So, now let us say that, let us look at a hypothetical line of all our configurations. So, let us say 


that at one end, we have a 0 valent configuration initial configuration. And on the other side, 
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we have a | valent. So, what does this basically mean? This basically means that there is a 
vector associated with this off the initial input values. And there is one more vector associated 
with this of the initial input values. The vectors are not the same. And because of this vector, 
the contents of this vector, we ultimately decide 0. And because of the contents of this vector, 


we ultimately decide 1. 


So, let us do one thing. Let us approach this configuration from this by gradually flipping one, 
one bit each. So, then let us say the hamming distance is K. We need to flip K bits to reach this 
configuration from this 1. So, as we keep on flipping bit after bit after bit, ultimately, we will 


reach two configurations where we make the 0 to | transition. 


So, let us say that we reach a configuration CO that is 0 valent. And we reach a configuration 
C1. That is 1 valent. And the only difference between CO and C1 in terms of their input vectors 
will be a process P. And the rest of the bits will be the same. The only difference will be in a 


given location, let us say for process P. 


So, let us say this has some value, and this has a compliment of that value, but the rest of the 
bits will exactly be the same. So, if this is the case, and there are two such configurations that 
are essentially one bit apart and basically for process P the inputs are different. So, let us use 


this as a starting point. 


So, what do we have? So, let me again, retrace our journey. So, the lemma 2, we had to prove 
that a bivalent initial configuration exists. We assume that to the contrary, let us assume that it 
does not exist. If it does not exist, then by partial correctness, which holds all our initial 
configurations are now univalent, but by partial correctness, some have to decide 0 and some 


have to decide 1. 


Furthermore, what we said is that in the initial state, we just have | bit per process. So, the 
initial configuration is just an n bit vector. And as we move from one configuration that is 0 
valent to a configuration, that is 1 valent will always arrive at a CO, Cl pair, which are 
essentially neighbours that differ by just one bit for the Pth process where CO is O valent and 


Cl is | valent. So, given the fact that we have come to this point, here is what we can do. 
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So, let us start at CO and consider an admissible run. And so, let the schedule of events be 
Sigma CO. So, which means that from CO, if I apply the set of event Sigma I will reach some 
state. Let us say, Cx. Similarly, what I can do is I can take C1 and apply the exact same set of 
events, but there is something special about this set of events in this set of events. Process P 


does not take any steps the way that we have defined. So, in a sense, process P is quiet. 


So, as we had discussed a faulty process, when the time comes, it betrays you. So, it basically 
does not take any steps. So, let us say that process P betrays us, and it does not take any steps. 
See, even though it is inputs are different, since it does not take any steps, the fact that his 
inputs are different in both the initial configuration CO and Cl, it is not visible to other 


processes. 


So, now when we apply a set of transitions, which are essentially a set of events are applied, 
but in this set of events, P is not taking any steps. So, regardless of where we start from CO or 
Cl, we are bound to reach the same finance configuration, which is Cx primarily because 


process P was only different. And in this case, it did not take any steps. 


So, given that we are reaching the same configuration, there is a problem. This was 0 valent. 
This was 1 valent. If this configuration is 0 valent, then this state cannot be 1 valent. Because 
if this configuration deciding 0, then this configuration could not be deciding 1 because once a 
state is, let us say 1 valent, regardless of the transitions that are made, so regardless of whatever 
transitions we make, all reachable configurations will be 1 valent, but that is not happening 


over here. 
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So, which basically means that for CO and C1, which decide different values, there can never 
be a common configuration that is reachable from both. But given the fact that we have a 
common configuration that is reachable from both, it essentially indicates that there is a 
contradiction because this is not possible. So, if this decide 0, this one has to decide 0, but then 
Cl cannot point to it because Cl would want this to decide to 1. So, there is clearly a 
contradiction and the contradiction tells us that we will at least have 1| initial state, which is 


bivalent fair enough, lemma, prove we are happy. 


So, now what we will prove is that we will essentially provide an inductive argument. So, we 
have a bivalent initial configuration. So, bivalent basically means that it is open-minded in the 
sense it has not made up its mind, whether it is going to decide 0 or 1. So, then it is an open 
mind configuration it is undecided. So, we will now prove that given an initial state is 


undecided. 


It is possible that after we apply a set of events, we will still arrive at an undecided state. Again, 
we apply a set of events, we will still arrive at an undecided state, so on and so forth. We can 
continue till infinity will always have a state, which is undecided, which means that our 
protocol will never be able to make a decision 0 or 1, which is exactly what we set out to prove. 
Which means that it is not totally correct. This means that every admissible run is not deciding 
or in common man terms. If we want to solve a problem, it is possible that we might run an 


infinite number of steps and still not be able to solve it. 
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So, what will help us in doing this is Lemma 3. So, what does Lemma 3 say? So, lemma 3 says 


let us C be a bivalent configuration. Furthermore, let us consider an event of the type p comma 
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m which means there an event destined for process p with message. So, let this be applicable 


to C. 


Then let the set B all the states, all the configurations that are reachable from C without 
applying e. so, essentially what I do is I apply all the other events, but I do not apply e. so let 
B be all the configurations that are reachable from C without applying E and furthermore, let 
the set D be all the configurations from B, where all that I do is I take a configuration here and 


I just apply e. 


So, given the fact that e was applicable to C, which is the initial configuration. So, then we did 
not apply it. So, we apply the rest of the event. So, that is okay. So, given that it was originally 
applicable, and of course, let us assume that the message got delayed, which means the event 
D got delayed, it will still be applicable. So, let us say after that, we reach a bunch of states B, 


and we can apply e to any one of them. So, we will reach a set of configurations D. 


So, mind you, C was a single configuration, which is bivalent. But B and D are sets. So, B is 
the set of all configurations that are reachable without applying e and then I apply e and then 
apply e to every single configuration of B. And then I reach the set D. So, what we, I want to 
prove is the statement of the lemma is that D contains a bivalent configuration. That is what I 


would like to prove. Now D contains a bivalent configuration. 


So, again, let us assume to the contrary that this is not the case. See if this is not the case, what 
is the proof by contradiction assumption that D contains only univalent configurations. So, this 
is what we are setting out to disprove. So, let us say that, given that C is bivalent so, we will 
start from this point. So, since C is bivalent, let us look at two configurations, EO and E1, where 


EO is univalent it decides 0 and El is univalent. It decides 1. 


So, let us now look at the different cases that are possible. And of course, EO and E1 will exist 
because C is bivalent. So, then EO and El will exist because it would ultimately take you 
towards two states, depending upon the transitions that are applied, which are EO and E1. So, 
let us consider, let us say the case of EO, if EO is an element of the set B, it means that E has 
not been applied to it. So, let us do one thing, let us apply E to it. So, then we will reach, let us 


say a state called DO, which is an element of set D and this follows from definition. 


So, same holds for, let us say the set El, El is element of set B in the sense it is reachable 


without applying E, then all that I do is I apply E to it, and then I will reach another state D1, 
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which is an element of D. So, now let us consider the other case, which I am taking out some 


real estate over here, and I will write it over here. 


So, let us assume that we reach the state EO after applying E. so, which means from C, we come 
here, then we reach some configuration. We apply E so whatever is this configuration is now 
a member is now a part of D. So, let us call this configuration as D 0 and there, after a set of 


transitions, we reach EO. 


So, basically this state over here has to be an element of the set D. So, this follows by definition 
because this state, this state is reachable from C. So, this state is an element of set B, and then 
we apply the event D so that by definition, this becomes an element of set D and given the fact 
that this state is not bivalent, so this follows from our assumption, and this ultimately leads to 
EO. So, this state has to be DO, which is univalent. So, using a similar argument, you can also 
say that, look, if we apply E before reaching E1, then this state over here is D1, which is an 


element of D. 


See, if I look at both these cases, what will they essentially tell us? What they will essentially 
tell us is that either there is a path from DO to EO, it does not matter. Or there is a path from EO 
to DO. So, one of these is true. Similarly, the path from D1 to El, or from E1 to D1, which is 
fine. But then what is the key conclusion? Why did we do all of this? So again, this is the last 


piece of real estate that I will use on this slide. 


The reason that we did all of this is to say is to prove something very simple. What we proved 
is that D contains both 0 valent and | valent. That is what we were able to prove that the set D 


it contains both 0 valent as well as | valent configurations. And how are we able to prove that? 


So let me again, go back. So, we set that look, state C is bivalent. It this means that two states 
EO and El would be reachable from C, but EO decides 0 and El decides 1. So, from our 
definition, let us we took two cases. So, let us assume that to each EO, we did not apply E then 
there is a state DO and D and a state D1 and D per DO, decides 0 and D1 decides 1. And then 
we looked at the other case, and let us say, we apply D to come to EO. Then also, there are two 


states in D where one state decides is 0 valent. And one state is 1 valent. 


So, this means that D contains both 0 valent and 1 valent configurations. It is not the case, even 
though D we are assuming, does not contain the bivalent configuration. It is not the case that 
D contains only 0 valent configurations, and only | valent configurations. It contains both. This 


is what we are able to prove. 
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Now, what we are going to do is we will play the same trick that we played in lemma 2, where 
we are saying that, look, we have the set D. So, from C, we have come to the set D by applying 
event e, basically, and this contains two kinds of configurations, one kind at 0 valent and 


another kind 1 valent. 


So, this is where we stand. So, now let us do one thing. Let us consider two configurations, CO, 
and Cl, both an element of the set B. So, recall that what is a set B? It is a set of all the 
configurations that are reachable from C without applying event D. Furthermore, it is possible 
to prove the same way as we did in lemma 2, that CO and Cl are essentially one step away. 
And so they are essentially one step away. And, and furthermore, the step DO comes from 


applying e to CO, and the step D1 comes by applying e to Cl. 


So, let me rephrase this. We are saying a lot of things. This is complicated. So, I will say this 
again. So, what was the set B? The set B was essentially a set of all the configurations that are 


reachable from set from configuration C without applying event D. 


No problem. Then when I apply event D, I come to a much bigger set. So, let us say this is the 
set. And this is set D and set D basically every single point within set D is generated by taking 
a point from within set B and applying event D. And what we were just able to prove that set 
D contains both 0 valent and | valent configurations. And by our assumption, we are saying it 


does not contain a bivalent configuration. 


So, now what I am saying is that there will definitely be 2 configurations, CO, and Cl. So, I 


start, they lead to DO and D1 after applying event D, after applying event D where DO decides 
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0 and D1 decides 1. So, that has to be the case because given the fact that you have DO and D1, 
they will have images and set B, and these would have been created by applying event D so 


this follows by definition. 


So, there is nothing non obvious over here. But if you look at all the states in set B, you will 
definitely arrive at one such pair where this relationship holds that Cl = e’(CO). So, this is 
similar to lemma 2, where we said that, look, if we consider all the configurations, that 
definitely one point will arrive. The two configurations are essentially one step away. One 


decides 0 one decides 1. 


In this case, of course, CO and C1. It is not the case that CO decides 0, but CO after applying an 
event goes to DO, which decides 0 and Cl decides, and D1 decides 1. But the point is that by 
lemma 2, well, let us say by a reasoning, similar to lemma 2 we can argue that, look, we have 
a large number of events, and all of them are essentially reachable from C via a large number 


of paths. 


So, then if I look at it, so let us see if I, if let us say this is C1, and I just look at it, I go back, 
back, back all the way up to C. So, there will definitely be 1 event where there is a boundary 
in the sense that it is one step away where it is images in D, one decides 0 and one of course 
decides | because all the states in D are univalent and there will be at least one such event pay, 


which are | step away, where CO would lead to DO and C1 would lead to D1. 


So, let us assume that this is what they are called. And the single step is essentially the event 
e’ = (p’, m’). So, given that we have been able to prove this, we will look now, look at two 
cases and look at and see what happens. So, recall that we are looking at two events over here. 
We are looking at event e and event e dash. So, event e applies to process p and e’ applies to 


the transition that e’ applies to process p’, which leads to the transition from CO to Cl. 


So, the case 1, which is kind of simple is that p’ # p. So, if p’ # p. So, then what does this 
mean? So, what this means is that we should draw another simple diagram and a simple 
diagram would look like this, that we had states CO from there, we apply e’. So, we come to 
configuration C1. Then what we know is that from CO, we apply event e. So, we come to DO. 


And similarly, from here, we apply event e and we come to D1. 


So now here is the fun part, given that e and e’, they are disjoined. So, e is working on process 
p and e’ is working on process p’. So, by lemma | commutative would hold because the process 


sets are disjoined, which basically means that if I apply e’ to DO, then I should be able to come 
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to D1. And here in lies, a contradiction, because any transition from a 0 valance state will never 
take us to a 1 valance state, but it, this appears to be happening given the fact that this appears 


to be happening. This is a contradiction and this is not possible. 


So, we clearly have a contradiction over here, and this is definitely not possible. Say if this is 


not possible, what we shall do is we shall consider the other case in which is case 2, where p = 


> 


p. 
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So, the other case where p = p’, let me write it over here. So, again, let us draw a few diagrams. 
So, we have CO. So, we apply e’ to it. We come to Cl. Then we apply e to it. So, we come to 
D1. Similarly, what I know is that I apply e to CO, and I come to DO. So, this is the information 
that I have. And furthermore, I know that both e and e’ are actually working on the same 


process. 


So, let us do one thing. So, this is where, again, the notion of our faulty process will come in. 
So, let us consider a run, a run Sigma where p does not take any steps. So, as I said, the faulty 


process, we trace us at the right time. So, let us assume that p does not take any steps. 


And so, let us consider one such run. So, in this run, given that p does not take any runs. So, 
let this be a deciding run, because the point is that we do not know for how long p is not going 
to take steps. And since we have assumed that our system is totally correct. So, basically for 
every state, we should have a, we should have an admissible and deciding run. So, let us assume 


that we have one such deciding run that P is not taking any steps. 
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So, P is the faulty process in this case, and it has betrayed us. So, in this case, we will apply the 
transition sigma where p is not taking any steps. And so, let us say that we arrive at state A. 
So, given the fact that we arrive at state A. So, now things appear to be possibly under our 


control. 


So, let us now apply event E to state A. So, then if we do that, we will reach some state, but 
here is the fun part e and Sigma are mutually disjoint. So, it basically means in e only process 
p takes steps. And in Sigma process, p does not take any steps. So, then mutually is joined. So, 
nothing stops us from applying Sigma to DO as well. And if we do that, given that DO is 


univalent, the state here will also be univalent. 


And so, basically this state will be EO. So, no problem. Let us do the same thing on the other 
side. So, let us first apply e’ And then let us apply e. So, if Il apply e’ and e in quick succession, 


then I will arrive at one state and this state, let us call it El. 


The reason is like this, that e’ and e are again on the same process p but process p is not there 
in Sigma. So, nothing stops us from applying Sigma to D1. And with that, we arrive at state El 
and El has to be 1 valent because D1 is 1 valent and they can see by lemma 1, this construction 
is allowed. So, just take a look at this once again, what we have done is that Sigma, as far as 
we are concerned is an admissible decidable run. And this exists because we have assumed that 
our protocol is totally correct. And from every configuration and admissible deciding run 


exists. 


So, then the thing is that if that is the case, then we can do this construction. And what we see 
is that from state A, we can reach a O-valance state and we can reach a | valance state. This 
means that state A is bivalent and. So, this means that state A is bivalent, but this is not possible 
since the run to A is deciding. So, since Sigma is a deciding run, A cannot be a bivalent state. 


Consequently, we have a contradiction over here because this goes against our assumptions. 


So, given the fact that this goes against our assumptions, lemma 3, as we have written, that 
stands proven where we said that D also contains a bivalent configuration. Because we looked 
at two cases and the two cases are over here, one case is over here and one case is over here in 
both the cases we arrived at a contradiction. So, this essentially means that as for lemma 3, D 
contains a bivalent configuration. So, now we are almost done. There is nothing much that we 


need to do other than spend some amount of time on the next slide. 
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So, here, what did we do? We, as we said that, look at the beginning, you will have a bivalent 
configuration. So, think of this as the base case of the induction. Then we said that, let us say 
if we consider an event D, which is applied to process p. So, what you do is that you drain the 
rest of the events that you have, you will reach a large number of states. From here then again, 


you apply e. so again, so it is possible to reach a bivalent configuration. 


And you just keep on doing that. You keep on applying events. You will keep on reaching 
bivalent configurations, the way that we have shown. So, it is possible that you keep on 
applying events and keep doing so infinitely, and you will still be reaching bivalent 


configurations, in the sense, you will still not be able to make a decision. 


So, this means that your consensus will not happen because your protocol is indecisive. And 
this basically means that you will simply not arrive at a state at a configuration where a decision 
has been made. So, this pretty much proves the theorem. So, what is the implication of the 
theorem? The implication of the theorem basically means that whenever I want to do any kind 


of a solve any kind of a distributed algorithm. 
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So, it turns out that most distributed algorithms can be mapped to some or other kind, a 
consensus problem, some or other kind, an agreement problem. So, it might not be able to 
efficiently map, but it can be mapped to a consensus problem. And given the fact that this 


consensus problem cannot be solved. 


Even with one faulty process, this means that if you consider faults in our system, our system 
will be severely challenged. So, it is simply not possible to arrive at any form of agreement. 
Even if one single process is faulty and this to us represents a significant issue. So, we will 
keep this in mind. We will keep referring to this as the FLP result. So, we will keep this in 


mind and subsequently, we will discuss a host of consensus protocols. 


So, we will discuss back source in later slides of the lecture set. We will discuss wrapped in all 
of them. We will say that look, yes, we know this result. It basically means that we will not be 
able to achieve consensus all the time, but at least let us aim to do this most of the time, that 
most of the time, let us try to achieve some degree of consensus and otherwise let us rely on 
some form of time out mechanisms and some form of timer because in the real world, nothing 
is really indefinite. I mean, it can be 1 second to 1 hour to 1 day, but with some additional 


timing mechanism we can do better. 


So, nothing is theoretically a synchronous. So, but of course the FLP result kinds of gives us a 


hard, lower bound of what is doable or rather what is not doable. 
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In this lecture, we will discuss the paxos algorithm. Paxos is a distributed consensus algorithm that 
was proposed more than 20 years ago and henceforth many avatars of paxos have been proposed. 
So, we will be looking at something called paxos made simple which is a simplified explanation 


of paxos which was published later roughly to in 2001. 
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So, we will first discuss the problem of consensus which is a very very important problem in 


distributed systems and then we will discuss the paxos algorithm and finally prove it. 
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So, the problem of consensus is very simple. So, it is like this that we have a multitude of processes 
and so we have different processes right over here. Each process may propose a value. So, the 
processes coordinate among themselves to agree upon one of the proposed values. So, let us say 
that this process over here proposes 3, this proposes 4, 5, 6, and 8. So, then these five processes 


coordinate among themselves to agree upon one of the values that has been proposed. 


So, kindly note that the value that all the processors agree upon has to be one among the proposed 
values. So, it is not the case that all of them can agree upon value 0 because value 0 it is true that 
all of them would agree on the same value but the value of 0 has not been proposed. So, this criteria 


has to hold that the value of v has to be proposed by at least one process. 


So, why is the consensus algorithm so important? Well, so the reason it is that important is 
basically because in a distributed system the main aim is that we want the entire system to look 
like a single system to an outsider which means that between the different nodes there has to be a 
notion of an agreement that they agree upon something, such that that agreed value can be showed 


to an outsider. 


So, let me give one example. So, one example in this case would be that of somebody trying to 
buy an airline ticket with a credit card machine. So, we have different processors. So, one process 
of course is the user, one process is the credit card machine, one more process is the credit card 
company. So, let me call it the cc company then if I am using an online booking site like 


MakeMyTrip, Expedia that would be one more entity. So, let me call this the site then of course 
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we will have the airline once the ticket is booked I get an email with my e-ticket. So, then of course 


my email is one more process. 


So, how many processes do we have? We have 1, 2, 3, 4, 5, 6. So, if I book a ticket all six of us 
have to be in agreement about two things either the ticket is booked. So, if the ticket is booked 
well then the machine validates my credit card the money is debited from my credit card, it goes 
to the site and then the airline books the ticket and I get the ticket via email. If let us say the ticket 
is not booked for some reason, then also there is no problem I am duly informed that my ticket 


could not be booked. 


The problem arises when let us say that I book a ticket for 10,000 Indian rupees. So, then 10,000 
rupees get deducted from my account and the ticket is not booked. So, this has happened to all of 
us it has definitely happened to me a lot that my ticket my money was debited but my ticket was 


never booked and, so then I had to keep on calling the site of the travel agent the online site. 


And they would they would say that all is well whereas all is not well and because I do not have 
my ticket and the main problem is that there was no consensus among these 6 sites, have there 


been a consensus among these 6 processes, then this particular problem would not have happened. 


So, either all of them would have said that my ticket is booked which means I would have had my 
ticket with me or all 6 would have said that for a certain reason they were not able to book my 
ticket, which is also fine by me but I do not want an intermediate situation. So, this is the example 
that we gave is essentially a commit, abort problem, where I say that commit means that my ticket 


is booked and abort means that my ticket is not booked. So, all of us have to agree on (())(6:02). 


And well, there can be many many other examples of a consensus. So, that can also be where [am 
doing distributed computation, and so then I need to ensure that all the processes get the same 


values and also others. 
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Consensus with Reliable Processes 


@ Elect a leader. 
@ Use the leader to collect proposals, decide on a consensus 
value, and broadcast the all processes. 
@ Issues: 
@ Fault tolerance 
@ Centralized processing 


The consensus problem is interesting in the presence of faults 
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So, we can say that what is the big deal with consensus, it is very simple. We can elect a leader we 
already have many algorithms to elect a leader, we use the leader to collect proposals decide on a 
consensus value and broadcast it to all processes. This seems to be a very fair idea where we have 
a set of processes. They elect one leader among them all the processes send their proposals to the 


leader, the leader chooses one and distributes it. 


Well, this sounds to be a good idea, the only problem is that if the leader fails then there is an issue 
but then of course we can say that we can elect one more leader. Sadly, there is also an issue with 


that, and so we will see that on the next slide and also the philosophical issue with this algorithm 
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is that in a certain sense we are centralizing the distributed algorithm, which in a certain sense also 
violates the spirit of distributed computing for distributing, in distributed computing the main idea 
is to increase throughput by allowing multiple servers, multiple processes, multiple processes 


running on multiple servers to do the job. 


So, why is this consensus problem that interesting, well because we are going to consider faults 
and that is because faults always happen in the real world and also see the reason that I was not 
able to book my ticket in many instances was basically because there was a failure at some point 
either there was a failure in the network or there was a failure in the credit card authentication 


process. So, these failures are what give a real world feel to the problem. 
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It is impossible to achieve consensus with even one faulty pro- 
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So, now let me come back to what I said on the previous slide, slide number 4 that we can elect a 
leader, the leader can choose one of the proposals. So, there is a very very famous result in 
distributed systems called FLP result which we have already discussed in the class. So, this says 
that it is impossible to achieve consensus where even we have even if we have one faulty process 


in an asynchronous system it is not possible to achieve consensus. 


So, what again is an asynchronous system? It is a system where the processors do not have a shared 
time base and they are not obliged to give a reply within a certain period of time. So, that is an 


asynchronous system. So, here if we have a faulty process, so of course we have no way of 
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distinguishing between a faulty or a failed process in a slow process a consensus is not possible. 


So, of course we had a long and elaborate proof. 


So, if I were to just take the gist out of that proof the gist would be that let me assume that I have 
2 n+ 1 processes. Let n processes propose | and that n processes propose 0. So, one of the 2n + 
1th process essentially holds the key and this is where the algorithm can get stuck forever because 
we will never know what it decided and this is essentially why the problem of consensus becomes 


that interesting even when we consider faulty processes. 


So, now the choices in front of us are reasonably stuck, we know that because of the FLP result, 
we cannot devise a consensus algorithm where even one process is faulty, but well faults are a part 


of real worlds. So, should we then abandon our efforts? Well the answer is, No. 
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@ Safety Property : Something wrong does not happen. 


@ If the traffic light on one road is green, then the traffic light 
, ) on the perpendicular road is red. 


‘ Liveness Property : Something good always happens. 
@ The red light will ultimately turn green. 
@ Let us propose protocols that never violate the safety con- 
straint. 


@ Liveness is a secondary criteria. 
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Instead, we look at what exactly we want. So, we need to prioritize. So, when we know that we 
cannot get all of it at least we should be happy when if we get some of it. So, what is that? So, we 
can look at the safety property and the liveness property. So, the safety property (sales) says that 


something wrong does not happen. 


So, consider a traffic light, if one of the roads is green, if the lights on one of the roads is green, 
then we do not want the light on the other road to also be green otherwise there will be a collision. 


So, we want the lights on the perpendicular road to be red. So, this is a safety property. The other 
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is a liveness property which says that something good always happens which means the red light 


over here will ultimately turn green. 


So, whenever we are looking at a protocol we would ideally like to have both but now we know 
that it is not possible to have both. So, the FLP result clearly says we cannot have, so, we cannot 
have both let us have protocols that never violate the safety property which in the case of consensus 
would mean that the key assumptions of consensus which is that choose one among the proposed 


values and everybody agrees on the same value this should not be violated. 


But liveness in this case would be that the algorithm always terminates and always is successful 
for all processes. So, let us assume that we are not concerned about liveness which means that 
there will be cases in which the algorithm will not terminate. So, let us just live with safety alone 


and propose the pack source algorithm that satisfies the safety property. 
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So, coming to the paxos algorithm, we have three kinds of nodes. So, so a process and a node we 
will use synonymously. So, we have a proposer, a proposer is one that proposes a value, then we 
have an acceptor. So, acceptor is a set of nodes. So, this that accept proposed values, so here accept 
is a temporary acceptance it is not a permanent acceptance it is just a temporary acceptance which 
means that it records the value that is being proposed and then, so we have three levels essentially 
of a proposal we first propose a proposal, then we send it to a set of acceptors where they can 


temporarily accept it and finally when the consensus is done, we call this choosing. 
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So, choosing means that a value has been chosen and consensus has been achieved. So, accept is 
a path in the middle where a value is just temporarily accepted or buffered. Then we have a learner, 
so learners are nodes that join the consensus protocol late and they want to learn the value that has 
been chosen, they want to learn the value that has been kind of accepted by everybody. So, note 
that a node can be a proposer or an acceptor at the same time and so it is not that we have designated 


nodes that are proposers and designated nodes that are acceptors. 
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Motivation 
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First-Accept Condition: C1 


An acceptor does not know how many proposals are there in 
the system. Hence, he must accept the first proposal that he 
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So, the main motivation of the algorithm actually comes from several of these conditions that we 


will name Cl, C2, C2A, C2b etc. So, let us outline these conditions, the first condition is called, 
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the first accept condition, condition C1. See if I am an acceptor I do not know how many proposals 
are there in the system. So, any node in a distributed system always has a very local view. So, it 


does not know how many proposals are actually floating around in the system. 


Hence, the only option or the only choice that an acceptor actually has is to accept the first proposal 
that it gets. So, it has no other choice. So, this is pretty much the best that the acceptor can do is 
that accept the first proposal that it gets subsequently it can become more choosy but at least the 
first proposal it needs to accept. So, now several values can be proposed by different proposers 


and different acceptors could accept different proposals. 


So, then the problem is we will have a situation in which our accepted nodes are sort of sitting 
there with accepting different proposals p, p’. It is further more possible that because of this but 
finally something has to be chosen. So, we will find that it will become necessary for an acceptor 
to actually accept multiple proposals and then ultimately do something such that one of them is 


ultimately chosen as the consensus value. 


And so, we will see that each proposer after several rounds of messages will ultimately need to 
choose a consensus proposal and we will see how that will happen. But the important point in Cl 
that I would like to make again is that an acceptor has a very local view an extremely local view. 
So, the first proposal that it gets it needs to accept and it also needs to accept many more proposals 


after that primarily because you will have other acceptors that have accepted many other proposals. 


So, the model that I am talking about is that you have different proposals that keep proposing and 
you have a set of acceptors that would be like buffering a subset of the proposals that it gets. And 


ultimately, from this subset one will be chosen, we will see how. 


378 


(Refer Slide Time: 17:46) 


Motivation 


Proposal Numbering 


<pid ctr) 


@ Proposal + (number, value) 
@ All the proposal numbers are unique 


Condition - C2(Consensus Condition) 


If a proposal (n, v) is chosen, then every proposal with a number 
greater than n that is chosen, has value v. 


@ By induction, we can prove that only one value is chosen. 


Let us now see, how to satisfy condition C2 to achieve 
consensus. 


Smvuti R. Sarangi Assorted Algorithms 


Say every proposal is a tuple is a two tuple number and value. So, all the proposal numbers we 
assume are unique and it is not required but it would be good to assume that they are also 
monotonically increasing at least from in every node issues monotonically increasing proposals. 


So, mind you this is not required but this will make our job easy. 


How do we ensure that the proposal numbers are unique? Well, this is also easy we can assume 
that it is a tuple of a pid and a local counter and so we have already seen this in the case of local 
LAN port blocks that with a combination of the process ID and a local counter we can generate 
unique numbers. Furthermore, each proposal proposes a value and one among these proposed 


values needs to be chosen. So, we will assume both of these are integers. 


So, the consensus condition is that if a proposal (n, v) is chosen then every proposal with a number 
greater than n that is chosen has to have value v. So, this is the consensus condition which we will 
prove that the moment we have chosen a proposal with a certain number then every subsequent 
proposal that is actually chosen by the system, so we will see that we can choose multiple proposals 
but all of them have to choose the same value v which means that this value v becomes the 


consensus value. 


So, this is sufficient to satisfy. So, condition C2 is sufficient to achieve consensus. So, which is 
reasonably clear that the moment we have chosen one value any higher numbered proposal issued 


by any process does not matter has to again end up choosing the same value. 
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So, we can now further kind of specialize condition C2 to condition C2a. So, we will propose 
several specializations. The first specialization is that if a proposal with value v is chosen then 
every higher number proposal that is accepted by any accepted has value v. So, what this 
essentially says is that, if a proposal with value v is chosen after that of course still nodes can keep 
proposing still nodes can keep accepting, but the acceptors are now bound by value v. So, they are 
not going to for any higher number proposal except any value that essentially does not have value 


V. 


So, here the explanation is that assume that a process wakes up and gets a proposal by condition 
Cl it needs to accept it this is clearly not desirable hence to ensure condition C2a we should, we 


have to strengthen C2a such that it will simply not accept any proposal that does not have value v. 
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So, the way that we want to ensure condition C2a is by proposing a condition C2b which will 
essentially ensure condition C2a. So, C2b this is what it says that if a proposal with value v is 
chosen. So, assume that a certain proposal with value b is chosen then every higher number 


proposal that is issued by any proposer has to have value v. 


So, so we are further tying up the system we are saying that look once something has been chosen 
then subsequently if any proposer is issuing any proposal then that also has to have a value v. So, 
this is a reasonably strong condition but let us look at the implications of this. So, the implication 


of C2b is that it automatically implies C2a. 
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Let us look since a proposal that value v is chosen. Every higher number proposal accepted by an 
acceptor also has to have value v because the only v is being proposed nothing else is being 
proposed. So, if you look back at C2b just take a look at this line. So, what I am trying to say over 
here is the moment that a proposal is chosen any other proposal proposed by any other proposer 


subject to the fact that that proposal is higher number, it has to propose the chosen value. 


We will see how to ensure that but assuming C2b is ensured, it is very easy to write this expression 
that C2b would imply C2a which is that any acceptor has to accept only value v well that is clear 
because only value v is being proposed and then essentially C2a would further imply C2 that since 


only value v is being accepted ultimately value v will also be chosen by a higher number proposal. 


So, what did we do? We wanted to ensure C2 to ensure C2 we tried to ensure C2a ensuring this 
will ensure this, to ensure this we are trying to we created a C2b and so if we can ensure C2b it 
will ensure C2a which will ensure C2 and now, and then we have said that if we can ensure C2 


then we can ensure consensus, so that is what our claim is. 


So, this, these conditions C1, so what did we do, we had a condition C1 and then we had C2 which 
of course we had these sub conditions. So, this is reasonably complicated. So, I would advise the 
listener of this video to go through this several times and to also read this in the paper because 


appreciating this part is reasonably tricky. 


And this is probably why many people have had a lot of problem with the pack source algorithm, 
but once we go through the algorithm and the proof these assumptions the need for these 
assumptions and these conditions will become clear. So, where we again, so condition C1 was let 
us go back and see that an acceptor does not know how many proposals are there in the system. 


Hence, the acceptor must accept the first proposal that comes by its way. 


So, this is that you accept the first proposal you get and furthermore you have to accept a few more 
to actually complete the algorithm. Condition C2 was that a value v has been chosen by so 
condition C2 over here was that if a certain proposal n v is chosen and in n v value v is chosen. 
Then any higher number proposal whose number is more than n that essentially has to choose 
value v that essentially has to accept value v that essentially has to propose value. If we can ensure 


these things we can ensure consensus and how to do that is what we will see in the algorithm. 
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So, the paxos algorithm is divided into two phases. So, they are known as phase | and phase 2. So, 
the algorithm might seem simple but it is in the category of deceptively simple algorithms where 
even a very simple algorithm is reasonably hard to prove. So, the first we start by sending a prepare 


message to a majority of acceptors. So, it is assumed that the size of the network is known. 


So, to a majority of acceptors we send a prepare message. The other very important point to note 
over here is that in the prepare message we only send the number of the message, the message 
number which is assumed to be unique. So, no two messages will have the same number. So, we 
are not sending the value this is important and we will see that this will be useful in proving 


condition C2b which essentially means that the proposer is not at liberty to propose a value. 


So, the proposer is basically dependent on other factors to propose a value and this liberty of 
proposing any value that the proposer wants to propose is at least not there with the proposer in 
phase 1. In phase 1, if we look at line number 2, the prepare message is sent with just the number 
of the proposal that is all. So, every acceptor maintains two state variables one is maxPrep and the 


other is maxAccept. 


So, maxPrep is the largest number that the acceptor has received as a part of the prepare message. 
So, the every prepare message sends the number of the proposal and the largest such proposal 
number that has been received by an acceptor is called maxPrep. So, this is accepted the prepare 
message is accepted by an acceptor only if n > maxPrep. So, mind you n cannot be equal to 


maxPrep because message numbers are assumed to be unique. 
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So, only if n > maxPrep is the proposal accepted otherwise we can add an else over here, we just 
do not do anything we just ignore and nothing goes back to the proposer or we can send a NACK 


a negative acknowledgement saying that this proposal has been terminated this proposal has failed. 


So, assuming it is accepted we set maxPrep <n, so this always monotonically increases and we 
return the value of maxAccept. So, maxAccept is essentially a value which currently the acceptor 
has accepted. So, the why the max comes and maxAccept will be clear in later slides but let us say 


that at the moment all that we can say is that max Accept is the value that the acceptor has accepted. 


So, in fact if itis given a choice of values it has taken the maximum that is where max comes from, 
but for let us assume that for the time being if the prepare message is accepted then the currently 


accepted value is sent back. 
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So, now let us look at, so once it is sent back we look at phase 2. So, let us say phase 2 starts from 
here. So, the end of phase 1 what happens is that we have the proposer here and we have multiple 
acceptors. So, they first got a prepare message, so then they are supposed to if they, if it passes 
phase | they are supposed to send a message back and the ith acceptor will send back the value of 
maxAccept 1. So, on similar lines the rest of the acceptors. So, you are assuming the rest of the 


majority acceptors will send back a maxAccept message. 


So, what the proposer will do is that after it gets back all of these maxAccept messages it will 


compute the maximum of these messages. So, it is essentially it will find the maximum value out 
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of these and assign it to a variable v. So, it is important to consider what happens at the time of 
initialization. So, [have not mentioned this but this is a side note. So, the variable maxAccept and 


maxPrep both are initialized to the null value. 


You can assume that if all the values and proposal numbers are positive this can be (- 1) or if is 
any other value that indicates that it is null. So, when we are starting phase | which means at the 
time of initialization all the maxAccept values that will be sent to the proposer all of them will be 
null. So, then we can add an extra line over here which is basically over here just after this that 


after this if all of them are null then the max of all the nulls will still be a null. 


So, if v = then essentially we set v = proposed value. So, what this essentially says is that if v = 
o which means that if all the acceptors send a null value then the proposer is at liberty. This is the 
only time when the proposer is at liberty to actually propose a value on its own, which would be 
the value that it would have, it wanted to propose but it is important that it does not have this 
liberty, it gets the freedom only if only when all the acceptors return a null value and then the 
proposer proposes a value and sends it to v otherwise, it needs to take the maximum of the values 


that it has gotten. 


So, after that we send an accept message. So, this time the message has the number and the value 
quite unlike the prepare message that did not have the value. So, we send an accept n v message 
to all the acceptors in the quorum and that begins phase 2. So, the phase 1 messages where the 
prepare message and the response maxAccept message and we begin phase 2 with the accept 


message. 
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So, we begin phase 2 over here with the accept message that is sent from the proposing node to 
the accepting node. So, phase 2 also works on a similar manner. So, in this case if n > maxPrep. 
So, note the equal to well the equal to comes because of the following reason. So, the first we send 
a prepare message then so this is phase 1 then we send an accept message assuming it passes phase 


1. 


So, in the middle if there are no higher numbered prepare messages then in this case we set 
maxPrep = n and maxPrep maintains its value till this point till the accept point it maintains its 
value. So, in phase 2 we might very well find that maxPrep = n it might be larger also. So, if it is 
larger then what does it mean it means that another prepare message came with a higher numbered 


proposal. So, in this case of course phase 2 fails otherwise phase 2 passes. 


So, if n => maxPrep which basically means that no higher number proposal has come in the middle 
then phase 2 passes and we set the value of maxAccept to the value that the proposer is sending 
which is v. So, essentially the max Accept field captures the value that has been sent by the proposer 
the last time phase 2 passed. So, subsequently the proposal n v is accepted, and v becomes the 
value of the acceptor, and this is what it uses later on, and it sends a response to the proposer that 


the proposal has been successfully accepted. 


So, it is important to at this stage understand and also at this stage realize that this phase 2 will 


only be successful if no higher number proposal has come in the middle. Because if a higher 
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number proposal comes in the middle, it will set maxPrep = n where n is the number of that higher 


numbered proposal. 


And thus, in phase 2, the current proposal will actually fail. If it does not fail it essentially means 
that during this time period no other proposal with a higher number has come and this is a very 
very crucial and important insight which we will be using in the proves. So, we set maxAccept to 


v in this case and which essentially means that this is a candidate for the consensus. 


So, it is a two-phase algorithm which essentially means that this is a candidate for the consensus. 
So, then we send the response to the proposer at the end of phase 2. So, the end of phase 2 what 
happens is, that all the acceptors that accept the proposal send a response back to the proposer 


saying that they have accepted the proposal. 


If we receive all the responses to the accept messages. So, recall that we select a majority of 
acceptors and we run the protocol with them if we receive all our responses to all the accept 
messages that have been sent then we are sure that the value v that the proposer had proposed 
either on its own or because of this line over here the max of maxAccept line, we can choose the 


value v and value v is the consensus value. 


So, this terminates the algorithm it is a very simple algorithm divided into two phases the first 
phase and the second phase and if the second phase completes successfully for all the acceptors 
who are part of the algorithm the phases then we declare victory and we say that value v has been 


chosen. 


So, then what the proposing node can then do is that it can send a message to all the other nodes 
that look consensus has been achieved and v is the consensus value. Now here is the fun part, the 
fun part is that even if the proposer just keeps quiet and uses the value v all the later proposers are 
still going to observe value v to be the consensus value and they are simply not going to observe 


any other value. So, we can sort of say, this, so this is a kind of a very intriguing result. 


388 


(Refer Slide Time: 40:08) 


Algorithm 
Paxos ig 


Paxos Algorithm - II 


Algorithm 2: Paxos: Phase 2 

Receive (maxAccept)) from a majority of acceptors: 
lv + max(maxAccept)) - 

send accept(n, v) to all the acceptors in the quorum 


Received accept(n,v) request: 
if n > maxPrep then 
maxAccept «- v 

accept the proposal (n, v) 
send response to proposer 


end 


Can nth [ 


Received all responses to accept messages: @ 


Smruti R. Sarangi Assorted Algorithms 


So, this is very non-intuitive, in the sense what, this is in a sense trying to say is that well we if I 
consider my entire system I ran an algorithm sent a few messages and after that the state of the 
system has been set to a certain value such that for every subsequent proposer if it proposes a value 
it is bound to actually choose v. So, the state of the system is not going to change it has been set 
and then every subsequent proposer will find v to be the consensus value. So, it does not really 


have to be broadcast. 


So, after the protocol if somebody wants to just come and learn the value. Well it can ask a few 
nodes if they are aware of the consensus value, if they are aware they can send a response. We can 
always designate distinguished nodes for this purpose or what the learner can do is that it can 


simply act as a proposal and try to propose a new value. 


So, if consensus has already been achieved, then the line over here will essentially not allow it to 
propose a new value instead it will have to propose something which is already there in the system 
and if the request goes through which means both the phases pass it will choose a consensus value 


which again will be v. 


So, as I said this is a pretty difficult thing to fathom a pretty difficult thing to visualize but this is 
actually true that after both the phases regardless of how many proposals come all of them are 
guaranteed and bound to choose v as their consensus value and the crux of this algorithm was a 


phase 1. 
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So, the phase | sets a temporary state in each acceptor which is maxPrep. So, maxPrep essentially 
sets a maxPrep = their proposal number. In phase 2, all that they do is that they come and check if 
maxPrep has maintained the same value or not. If it has not it means that there is an intruder in the 
middle and then the abort but for any for any sequence of messages where there is no intruder with 
a higher proposal number in the middle consensus will be achieved and the value that will be 


chosen will be the consensus value. 


Once consensus is reached, it is reached for all time and all values chosen will be the same which 


is our claim and we will proceed to prove this in the next few slides. 
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Analysis 


Analysis of Paxos 


Let us consider two proposals, P;, and Po. 
° @ If an acceptor(A) receives Pp’s prepare message after P,’s 
accept message, then P; < P> at A. 
@ If an acceptor(A) receives Po's prepare message after P,’s 
prepare message is ignored, then P; < P> at A. 


y 
i) @ P; is concurrent with P. (P; >< Po) at Aif, if P; 4 Po, and 
o Te asog “poe 


So, now the proof follows. So, before we explain the proof. Let us go through a few definitions. 
So, consider two proposals P1 and P2. So, if an acceptor a receives P2’s prepare message after 
P1’s accept message which essentially means, so let us maybe go out and look at this in some 


detail. 
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So, consider proposals P1 and P2. So, let us assume that there is an acceptor A and let this be the 
timeline. So, assume that P1’s accept message comes at this point and then comes P2’s prepare 
message, then what is clear is that P1 has finished its phase 2 and then P2’s phase 1 begins that 
part is clear. So, we can say that Pl < P2. 


We can have another case at the common acceptor where P1 sends its prepare message then what 
happens is that the prepare message gets cancelled because that n greater than maxPrep check fails. 
So, this essentially the phase | of this fails and then P2 sends a prepare message, so if P2 sends a 
prepare message then clearly there is a strict ordering between P1 and P2. So, then also we say that 


PI < P2. 


Similarly, we can have other cases where some P3 < P4. So, this is a standard precedence 
relationship at a certain acceptor. So, if at all acceptors, so it is, if at all acceptors Pl < P2 we then 
say that P1 globally < P2 but we did not go that far. What we want to now show is that it is possible 
for P1 and P2 to actually be concurrent. See Pl < P2 and P2 < P1, then essentially we say that P1 
and P2 are concurrent and concurrency we are depicting by this horizontal hourglass that P1 is 


concurrent with P2. 


So, what would be a practical example for that well at an acceptor what we would have is that we 
will have P1’s phase | and then we will have P2’s phase | and then P1 will try to initiate its phase 
2. So, in this case clearly there is no precedence relationship between P1 and P2. So, let us say this 
is a phase | is a prepare message for them and phase 2 is they accept. So, clearly in this case there 
is no precedence relationship we cannot say that Pl < P2 or P2 < Pl. So, Pl and P2 both are 


concurrent. 


So, what we further say is that for two proposals if they are concurrent at any accepter they are 
just said to be concurrent. So, now let us go back to our slide presentation. So, we say P1 is 
concurrent with P2 at A, if Pl and P2 do not have a precedence relationship between each other 
and at any acceptor if P1 and P2 are concurrent then we say that P1 and P2 are globally concurrent 


otherwise at all acceptors the precedence relationship holds. 
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So, now we will prove a sequence of three theorems which will essentially prove that paxos 
successfully achieves consensus. So, let us prove the first theorem which says that if Pl and P2 
are concurrent and P1’s number. So, recall that we were using the n field in the algorithm but we 


are using num now, they mean the same thing but this is easier to explain and visualize. 


So, if Pl and P2 are concurrent and P1’s number is less than P2’s number. So, Pl, so we claim 
that for two concurrent proposals of this nature the one with the lower number which is P1 that 


will not pass phase 2 at the common acceptor. So, it will not complete its algorithm basically it 
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will not complete both the phases so that, this is easy to prove. So, there must be some acceptor 


that gets messages for both proposals. 


So, let us assume that that acceptor is A. So, we will consider two sub cases. So, the two sub cases 
that we will consider sorry one second, so the two sub cases that we will consider are like this that 
assume it first gets a prepare message from P1. So, which means that first from P1 it gets a prepare 


message. So, since they are concurrent, so it will get a prepare message for P2 after that. 


So, since P2.num > Pl.num the value of maxPrep at this point will be set equal to P2.num. 
Subsequently, P1 will send its proposal P1’s except message will come. So, the first thing that the 
accept message actually checks is the value of maxPrep and at this point the except the phase 2 for 
P1 will fail because the value of maxPrep has already been set to P2.num and P2.num > P1.num, 


so, the maxPrep > P1.num and this check which is essentially shown over here will fail. 


Because this will fail P1 will not succeed in phase 2. let us consider the other case, the other case 
is we first get a prepare message from P2. Well so this is even simpler if this is this basically says 
that at this point maxPrep is set to be P2.num. Now, PI sends Pl.num as a part of its prepare 
message. So, this will clearly fail phase 1 because in phase | we check if P1’s number > maxPrep 


or not which in this case it is not. 


So, we see that in both the cases for both this first case and second case P1 is not able to pass both 
the phases which means that P1 is not successful process of proposal P1 is not successful. So, 
what, so whatever does proven we have proven that whenever two requests are concurrent two 
proposals are concurrent the one with the lower number always fails. We cannot say about P2 it 
might be concurrent with something else but at least P1 will fail that much we are sure. So, we 


considered concurrent proposals in the last theorem. 
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Now let us look at two proposals where one precedes the other, so if P1 < P2 what happens, see if 
P1 < P2 and P1’s < P2’s number, then P2 has to fail, P2 will not pass phase 1 P2 has to fail. So, 
this is very easy to prove we will consider a common acceptor and there has to be a common 
acceptor because they send messages to a majority of acceptors. So, there has to be one acceptor 
in common. So, that acceptor what will happen, so since Pl < P2, P1’s prepare message comes 


first maxPrep = Pl.num. 


Subsequently, when P2’s prepare message comes, we do the same check with maxPrep and this 
will fail because P2’s number > P1 number. So, the check in phase | will essentially fail and P2 
will not pass phase | because we check with maxPrep but P2’s number > P1’s number. So, take a 


look at this. 


So, what have we just proven? We have proven that if one proposal precedes the other. So, let us 
say P1 < P2 for P2 to actually pass and be successful it needs to have a higher number see if I take 
the sum total of the last two theorems they say something very important. What they essentially 
say is, what they essentially say over here is that if for two concurrent proposals the one with the 
lower number is going to fail and if proposal P1 < proposal P2, then again if P2 has a lower number 


it is going to fail. 


So, in a certain sense it establishes a certain monotonicity or a strictly increasing order of past 


proposals. So, for proposals to pass both phases they need to have strictly increasing numbers. If I 
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were to draw a graph it needs to be strictly increasing and further more if two proposals are 
concurrent then the one with the lower number will fail for sure. So, essentially if I were to if lam 
an external observer and I keep seeing the proposals that are passing both the phases I will see an 


increasing number monotonically increase. 
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I will combine the last two results to groove the main result which is theorem 3. So, this says that 
if Pl < P2 and both of them succeed, both of them phase 2. Then they have to choose the same 


value this is the operative part. So, which essentially means we have already established a 
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monotonicity of the numbers of the chosen proposals of the proposals that pass both the phases, 


now we want to say that for all of them they choose the same value. 


So, again we can do the same between P1 and P2 find a common acceptor. So, for P2 to succeed 
P1, well P2.number > Pl.num again follows from the monotonicity. So, now let us set up a 
mathematical induction problem. So, the induction hypothesis is that all the successful proposals 
with numbers between P1.num and P2.num choose the value chosen by P1. So, if P1 chooses v all 


the proposals intermediate proposals which pass both the phases also choose value v. 


Now what do we have to prove for the induction to be correct. What we need to prove is essentially 
that P2 also chooses value v and this has to follow from the induction hypothesis. So, let us see. 
So, the common node A, what did it do? So, what did it do is that, so this will come into picture at 
this point at the end of phase 1 when those maxAccept values are being sent, the common node a 
must have forwarded its value at the end of phase one to the proposer of P2. So, that it has to do 


because for P2 to succeed a value has to be forwarded. 


So, from the induction hypothesis this value has to be v at the end of phase one because that is the 
chosen value and, so this has to be the case. Why is this the case well you need to understand that 
A is the common acceptor between P1 and P2 and since P1 chose v. So, it must have sent v at the 
end of its phase 1 to A such that A would have recorded it and then A would have sent a response 


and this successful response would have been recorded by P1. 


So, the value v that Pl chose must have been there with A and A must have forwarded this value 
at the end of phase | to the proposer of P2. So, this is an important argument to be made here. So, 
now for P2 what would it do at the end of phase 1? Well it will take a look at all the values that it 
gets one of them is v and if v is the maximum. Then there is no problem P2 will choose v and we 


are done and we have successfully proven. 


So, what we need to prove is that it will never be the case that other than v some other value is 
chosen. So, let us just assume for the sake of proving that some value v’ is chosen which is actually 
not v. So, since the value of v’ is chosen and we and if we see this since v is proposed we cannot 
choose any value that is less than v. Whatever we choose since it is a max operation has to be v’ 


and v’ > v. 
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Assuming we are choosing something which is not v. So, in this case, let us assume that let the 
proposal P3 propose v’. So, by the induction hypothesis where we are assuming that all the 
proposals between P1 and P2 not including P2 of course all of them choose v. So, what would be 
actually happening all of them propose and choose v. So, what would be happening is that there 


are two cases either P3 appears before P1 or P3’s number appears after P2. 


So, if P3 let us say appears before P1 then P1 must have seen P3 is value and P1 would not have 
chosen v because v’ > v. So, it is clearly not possible. So, this case is not possible. So, the other 
case that we are looking at if I were to draw the same diagram would be P1 a set of chosen 


proposals then P2 and then P3. 


So, now let us look at two sub cases is P2, do P2 and P3 have a precedence relationship. Well if 
you look at this it is not really possible because P2 is choosing the value proposed by p3. So, they 
cannot have a precedence relationship. So, the relationship that the only relationship that they are 
allowed to have is essentially that both of them are concurrent. Now if both of them are concurrent 


and P3’s number > P2’s number by theorem 2 by the second theorem. 


We observe that, so what do we observe, we observe that P2 and P3 are concurrent and P2’s 
number < P3’s number. So, by theorem 2 it is patently clear that P2 cannot succeed but now we 
know that P2 has succeeded since P2 has succeeded this case also cannot happen. So, given that 
these two cases cannot happen we have proven by contradiction that we cannot use a value of v’ 
which is actually greater than v which proves that the value that we actually choose for P2 is v and 


this proves our induction hypothesis. 


So, just outline of the proof again and the outline of the proof is that if Pl chooses v and if you 
assume that all subsequent higher numbered proposals that are totally successful also choose v, we 
need to prove that P2 also chooses v. So, we set up an induction and then we show that the common 
acceptor must have proposed v. Say any value that P2 chose has to be greater than equal to v if it 
is v no problem, if it is greater than v then whoever was the original proposer of the value greater 
than v which is v dash either had a number less than P1’s number or a number greater than P2’s 


number both the cases are not possible hence P2 chose v. 


So, this establishes our induction hypothesis that once proposal P1 is fully successful and the 


consensus value is v our previous theorems tell us that no proposal with a number less than P1.num 
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will ever pass both the phases that is clear. Any subsequent proposal whose number is greater than 


P1 will of course pass both the phases but the only value it is allowed to choose is v. 


So, that is the only value it is allowed to choose and furthermore if I were to consider the condition 
C2b when I am computing the max operation what we have just been able to prove that one of 
them will at least propose v and there is no value in the system which is higher than v which 


exceeds v. 


So, even the values that will be proposed will all be less than equal to v and one of them will be 
equal to v. So, ultimately the values that will be also be proposed will also actually be v and also 
the value that will be accepted will be v. So, this has established and we have been able to prove 


the safety property which essentially says that consensus is going to hold. 
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So, we have not said anything about liveness which is the progress property. So, let us assume P1 
and P2 are concurrent and we will show that there is a scenario where nobody will succeed. So, 
consider P1, Pl sends its prepare message then P2 sends its repair message I assume P2.num 
>Pl.num. So, then Pl comes with its accept message the accept message will not pass phase 2 


because of intervening P2 so it gets failed. 


Then the proposer of P1 sends P3, P3 is prepare message has a greater number than P2’s prepare 
message. So, then P2 sends its accept the accept again fails and then, so on and so forth. Then P2 


sends a P4 and P3 fails, and P4 fails, and P5 fails, so it is possible we might have an infinite 
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sequence of such kind of failures. So, at the end none of no proposal will succeed in both the 


phases and because of this infinite chain of failures will theoretically never achieve consensus. 


So, this is clearly in line with the FLP result which says that we will always find some sequence 
of actions where consensus is not achievable but again liveness was not our aim because we knew 
that both safety and liveness are not simultaneously enforceable. So, one way is to of course 
eliminate concurrent proposals, we can use a leader who is only allowed to propose similar ideas 


have been taken up by later consensus proposals to simplify things. 


Of course, we still cannot find a way around the FLP result, so no point in trying and so but at least 
paxos provides safety. Just in case any, anything happens say the paxos still provides a certain 
degree of safety net and furthermore the idea now is that paxos is complex. So, why is paxos 
complex? Well because this was the protocol for accepting a single value but let us say that we 
want to send a sequence of values like a sequence of transactions to different servers, then 


essentially all the servers have to agree on the sequence of values. 


So, we will have to run multiple rounds of paxos. So, this makes things very complicated and that 
is where other consensus protocols will come in which are simpler and simpler than paxos and 
provide more elegant solutions but nevertheless paxo still remains one of the most popular and one 
of the most widely used and widely discussed consensus protocols may be the most popular. And 


anything else any other later consensus protocol use paxos as a gold standard for comparison. 


400 


(Refer Slide Time: 68:02) 


Paxos 
igs Analysis 


fq The part-time parliament, Leslie Lamport. ACM TOCS, 
1998. 


fi. Paxos made Simple, Leslie Lamport. ACM Sigact News 32.4 
(2001): 18-25. 


So, the original paxos paper was published in 1998, which was reasonably long and reasonably 
hard to understand. So, what we have presented in this lecture is paxos made simple by the same 
author Leslie lamport in 2, which was published in 2001. So, in subsequent lectures we will discuss 
other kinds of consensus protocols which are simpler and which are specialized to certain 


application domains 
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The Raft Consensus Protocol 


Prof. Smruti R. Sarangi 
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In this lecture, we will discuss the raft consensus protocol which is different from the paxos 
protocol. So, it is different in many senses. So, paxos is a far more complex protocol. So, raft is 


substantially simpler. 
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So, let us look at the main features of raft. So, the primary motivation of creating raft was that 
paxos is perceived to be too complex and. So, that is the reason when paxos is taught typically a 
simple version of the protocol is taught, which is called a single degree consensus, which 
essentially means we are agreeing on a single value not on a list of values. The problem is that 
most of the time we do not want to agree on a single value but rather we want to agree on a series 
of values such that in any distributed system different nodes will apply the same set of values same 


set of changes one after the other on the state machines. 


So, this is primarily, so paxos can do it. So, the full version of the paxos protocol can do it but 
since it is complex a simpler protocol was devised which is easy to understand debug and verify. 
The main advantages of the raft consensus protocol are understandability, it is simple. It is 


naturally tailored towards a list of values not just a single value and it is easy to implement. 
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The overview, so the key idea here is that we have a replicated state machine model which means 


that each server maintains a state machine. So, we essentially have multiple servers and so let us 


say anything like providing a web page or checking email. So, instead of one email server we have 


multiple email servers each of them has exactly the same state machine the same finite state 


machine. So, clients send a request to the servers, so what a dispatcher does, a dispatcher is that it 


replicates the same so the same request is replicated to all the servers all of them apply the clients 


request so the client command to the state machines. 
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And so since we can have a series of commands from different clients there is a need to create a 
list of commands that are seen by all the servers and there needs to be complete agreement a 
complete consensus on this list such that they can be applied in the same order otherwise we can 


have interesting violations in consistency. 


So, let us say for example there are two commands; the first command is to read an email the 
second is to delete an email. So, if let us say the read comes here first and then the delete it is fine 
we first read the contents and then we delete the email but it is possible that another server unless 
we do something might say delete first and then the read that is an issue. Because in this case the 
mail will be deleted first and the read will fail and this is not something that we want hence we 


want all the servers to see the same set of commands in the same order. 


So, this of course is a consensus problem, and this is a multiple degree instead of a single degree 
it is like a list consensus, and whether multiple orders that are given out, and this of course is hard 
to do in paxos. So, we that is the reason we would typically use raft and we will also find when we 
study blockchain and bitcoin and all of these technologies that raft is one of the key protocols 


behind IBMs. 


Well, IBM led hyper ledger fabric which is a very, very common blockchain that is used but we 
will have a later lecture on blockchains, where we will discuss this in great detail but the important 
point that needs to be kept in mind is that raft is one of the key foundational technologies of hyper 
ledger like blockchain systems. So, the idea here is that we first elect a leader, the leader then 
accepts the requests from the clients, it replicates them at the servers as we just saw, then it informs 


the clients when they can process the message in consensus order. 


So, Iam sorry it will inform not the clients the servers when they can process the all the messages 
in consensus order and such that all of them see the same set of messages in the same order. So, 
what we do is that so, we divide time as follows we have a leader, then the leader maintains his 
leadership for a period of time, then if the leader crashes we elect a new leader and again the leader 


does its role. So, time is thus divided into what we call terms. 
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Let us now look at some of the key safety properties that we need to ensure in the raft algorithm. 
So, the first is election safety which means that at any point in time we have at the most one leader 
or in other words what it means is that in each term we will have at the most one leader, it will 


never be the case that we have two simultaneous leaders, two leaders at the same time. 


The second is, so this is a standard approach in distributed systems that we have an append only 
log. So, log is treated as something like a array that just keeps growing. So, each of the log entries 
has an index which is just an increasing sequence of numbers 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and then 
we have a set of terms. So, pretty much every entry that was added in a given term has that term 


ID. 


So, we can say that the first three entries are in term 1, the second set of four entries are in term 2 
and so on. So, the leader particularly never overwrites or deletes any entry in the log it only appends 
new entries. So, the log only keeps increasing this way entries are never deleted or they are never 


overwritten by the leader. So, this is not true for followers but at least for the leader this is the case. 


Then we will come at two of the most important properties that are log matching and leader 
completeness. So, log matching basically means that if we consider two logs, so the leader 
maintains a log all the followers maintain a log everybody maintains a log. So, we want that till 


some point the logs are exactly identical. So, this is the consensus that is a part of raft. 
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So, the point to which the logs are identical, so let us say that till 0.7 the logs are identical, then 
what we can do is the moment we are sure of that all of these entries can be passed on or applied 
to the state machines. So, what the log matching property says that if two logs contain the same 


entry at a given index and term. 


So, let us say index the entry with index 7 it should have the term T2 and the same contents if that 
is the case then the logs have to be identical till that index which means that if this index is the 
same the seventh index is the same even all the entries before it which means all the entries in the 
indexes | to 6 they also need to be the same. So, this is the log matching property and we will see 


that the algorithm explicitly ensures this. 


The other important property which we actually need to prove at the end is the leader completeness 
property which means that if an entry is committed in a given term. So, let me just explain what it 
means to commit an entry. So, to commit an entry basically means to ensure that the entire log till 
that entry. So, let us say we want to commit entry number 7, all the entries still 7 have been 


successfully replicated across all the servers or a majority of the servers. 


So, at that point we explicitly commit the entry which basically means we tell all the servers that 
entries from | to 7 can be shown to their state machine can be applied to the state machine, this 
process is called committing. So, if an entry is submitted in a given term it will be present in the 
logs of leaders of successive terms which means that if a new term begins a current leader crashes, 
and the new leader is elected, and a new term begins it need not have the uncommitted entries of 


the previous leader but it definitely has to have the committed entries of all the previous leaders. 


So, this ensures that the log grows in a certain way and any change that is applied to a finite state 
machine that is correctly applied. Otherwise, what would happen if you do not have this property, 
that the leader of T2 would apply an entry and then T3 will not find it. So, then from the leader 3 
who owns the term T3 from his point of view this entry should not have been applied to the finite 


state machine and that should never be the case. 


So, that would cause a correctness issue hence the leader completeness property is required. So, 
then we come to state machine safety which kind of follows from the rest. So, if a server has 


applied a log entry to a state machine, then all servers will apply the same entry at the same log 
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index. So, this also follows from the notion of the way we create the logs and the way we create 


and the leader completeness property and the way we commit the entries. 


So, the state machine safety property naturally follows from the first four, so we will not prove 
this separately. We will essentially look at the log matching and the leader completeness properties 


because they are sufficient to ensure the rest given these assumptions in the rest of the property. 
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So, this is the way that a raft cluster looks like we have all the servers start in the follower state, 
then in the follower state once the servers timeout they start an election. So, every time an election 
is started we increment the term number. So, then the servers become candidates and they request 
for votes from the rest of the servers. So, what essentially happens is that one server which is a 


candidate it will send a vote request a message to the rest of the servers. 


So, if it gets votes from a majority of servers it is sure that it can be the leader otherwise what 
happens if there is a split vote, which means nobody is able to garner a majority in that case we 
increment the term number and start a new election. When the server is a candidate if it discovers 
that a leader has already been elected which means the term has been incremented and there is a 


new leader. 


So, it relegates itself to the follower state which is essentially this path and it is further more 
possible that a leader actually crashes and wakes up a long time later just to find that the term has 


been incremented and a new leader has been elected. So, in this case also the old leader realizes it 
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is not the leader anymore. So, it goes back to being a follower. So, a raft cluster does has the 
typically has 5 servers or more not less and so the followers become candidates. So, candidate is a 
temporary state, so follower and leader these are stable states but the candidate state is a temporary 


State. 


And so followers become candidates still a leader is elected then they become followers once 
again. If the leader finds another leader of course at the higher term ID as we discussed over here 


it becomes a follower once again. 
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So, we are dividing time into terms. So, a term begins with a leader election we have a new leader 
subsequently normal operation commences in the normal operation clients send messages to 


servers they are replicated our consensus algorithm runs, then it is possible that the leader crashes. 


So, what happens is a new leader is elected. Subsequently, normal operation commences and then 
again we have one more election once the leader for term 2 crashes. So, in term 3 it is possible that 
no leader could be elected because of split votes. So, in that case we will have a timeout we will 


increment the term number and we will then elect one more leader. 


So, note that there is a probabilistic process it is not guaranteed to terminate and this is one of the 
cases where, so we know that in an asynchronous system because of the FLP result it is possible 


that we will never achieve a consensus. So, this means that any protocol any consensus protocol 
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in an asynchronous system has to have a set of events where consensus is not possible and this is 


one such case where we will simply not be able to elect a leader. 


So, now what happens over here is that, so let us assume we have a period of split votes and then 
we have a new leader and regular execution commences. So, each server stores a current term 
number and the term number is attached to every message this is required such that everybody 
knows what is the term and who is the leader. So, it is possible that servers who are either followers 


or candidates or leaders would crash and recover. 


So, they would use a term number to find out about the stillness of messages. So, if a server with 
a lower term number sends a message to a server with a higher term number then it means that 
clearly the server with the lower term number is not aware of the changes that have happened when 


it was in the crash state. 


So, in this case the latter which is the server with the higher term number rejects the message drops 
the message. In the reverse case, if a server with a higher term number. So, let us say a server with 
a higher term sends a message to a server with a lower term number then the latter which is this 


server will just upgrade its term. 


Furthermore, if a candidate or leader discover that the term is still that their term is still which 
means that they are getting a message from another server with a higher term number they will 
realize that maybe they have crashed and then they have recovered or there was some partition in 
the network something like that has happened. So, they will move to the follower state which we 


have described in the previous slide. 
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Now let us come to the details. So, all the servers as we just mentioned start in the follower state. 
They periodically get messages from the leader which are heartbeat messages, each heartbeat 
message definitely contains the term and the leader ID. So, they always know who the leader is 


and what is the current term and they use this to figure out if a message is stale or not. 


So, if a server does not get a heartbeat for a pre-specified duration this essentially means that 
maybe the leader has crashed. So, what it does is that it times out and it begins the process of 


electing a new leader it becomes a candidate itself and begins the process of electing a new leader. 
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So, now let us see what happens while beginning an election. So, for beginning an election the 
server increments its current term, so this is required. So, every election be it successful or 
unsuccessful has to happen with a new term and it transitions to the candidate state. It votes for 


itself and it sends a request vote message to the rest of the servers. 
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Let us look at the three possible outcomes and see what will happen in each of these cases. So, let 
us say it wins the election. So, if it wins the election that is great. So, which means that each server 
majority of servers vote for it. So, kindly know that in an election each server votes for only one 
candidate in a given term, so that is important. And so, it basically means that in a given term the 
server will only vote for a single candidate and so the candidate is also not allowed to vote for, so, 
a double voting is not allowed and that also includes the candidate voting for itself that also is not 


allowed in a given term. 


The leader especially needs to get a vote from a majority of servers, so this is required and this is 
where we can have a never-ending cycle. So, then the leader begins its term and sends heartbeat 
messages to the rest of the servers. So, with the heartbeat message what do the rest of us rest of 
the servers get they definitely get the term ID the new term ID and the leader ID now let us look 


at the second scenario. 


So, let us say that it either did not participate in the election or it did not win but after that it either 


gets a heartbeat message or a regular append entries append entries to the log message from a 
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server say the term is greater than or equal to the current term it means that the new server is the 


leader. 


So, we recognize the other server as the leader and the candidates will transition to the follower 
state, which we have been describing that just in case it is not able to win or before sometime in 
the middle of the election it gets a message with a higher term number, it should convince itself 
that it has lost the election and so it will just recognize the new leader and transition to the followers 


state. 


If that is not the case, it is a stale message, we just ignore the message. So, let us consider this case 
where no leader is elected we have split votes. So, what happens is that in this case the candidate 
will time out mainly because it is not getting enough votes and it will start a new election a new 
round of elections. So, one way, so we can always argue that look all the candidates start, then 
they do not get enough votes then the time out again they start again they do not get enough votes 
so on and so forth. So, it by this process we never achieve a leader, we never nobody ever wins 


selection. 


So, to solve that the nodes have a randomized timeout which means that for random periods the 
nodes kind of sleep in do not try to elect themselves leaders. So, because of this randomized time 
out what happens is that we are trying to minimize the contention trying to minimize the overlaps 


of when the leader like elections will happen. 


So, if the randomized timeout is see if this is a function that can kind of separate the leader election 
periods it is possible that we will have only a few servers or maybe just one or two that are vying 
to be the leader at a given point in time. So, we can then assure them a majority probabilistically 


of course. So, this kind of minimizes the chances of having split votes. 
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Now let us consider the fact. So, now we have elected a leader we have taken care of the split votes 
issue. So, let us look at log replication how is it that we achieve consensus in the stored logs. So, 
after a leader has been elected, the clients send it requests the leader sends append entries messages 
to the rest of the servers and so these servers, then append the entries. So, we will see in a second 


how it happens. 


So, let us look at the structure of a log. So, the log is maintained by the leader in all the servers it 
is simple list as we said but each of them has an index and a term. So, this is the index and of 
course we have a term where multiple contiguous indices will have the same term and then each 


entry stores acommand. So, the important operative parts are the index the term and the command. 


Now to commit an entry, a log entry is committed once the leader has replicated it on a majority 
of servers. So, the majority of servers the entry has to be replicated with the same index and terms 
in the majority of the servers. So, which means that in this case entry number 4 has to be present 
with the same command in the majority of the servers that is when we say that the entry is 


committed and if the entry is committed, this commits all the preceding entries as well. 


So, all the entries before this are committed. So, if we would see we had the log matching condition 
that if essentially they match at one point they will match at all the previous points. So, because of 


log matching we need to ensure that this genuinely happens. 
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So, the leader then what the leader would do. So, the commit is associated with a state. So, the 
state in this case if the leader commits entry 4. Well when will it commit entry 4 when entry when 
the entry with index 4 is present in the logs of a majority of servers with the same index and the 


same term. So, to commit the leader has a commit index and internal variable. 


So, it just sets that to 4 which means that all the entries 1, 2, 3, 4 are committed say any message 
that the leader sends this will include the highest committed index in the message which means it 
will include 4. So, any server that gets the message, it will know that the entries 1 to 4 are 


committed and these entries then can be applied to its finite state machine. 


So, once the followers see the message they commit their corresponding entries one after the other 
in the order in which they are stored in the log. So, of course let us say they have applied the 
commands at indices | and 2 then they will now apply the command at indices 3 and 4 but not 5 


that is important not the one at 5. 


(Refer Slide Time: 28:44) 


Log Matching Property - | 


* Key safety properties (Log matching property) 


* S1: If two entries in different logs have the same index and term, they store 
the same command, 


+* S2: If two separate logs have the same index and term, all the preceding 
entries of the logs are identical 


Log index 1 2 3 4 5 6 zi Matched logs 


ih 2 2 laas Leader 
x€3 (y€5) x€2 263. 
t a 2 Follower 
; 1 x 3 ved 


| a4 


Unmatched log 


So, let us now look at the log matching property. So, the key safety properties that we have 
discussed out of them the two important properties are the log matching property and the leader 
completeness property. So, the log matching property let us break it into two sub conditions S1 
and S2. So, if two entries in different logs have the same index and term they store the same 
command this is the first property. Second if two separate logs have the same index and term all 


the preceding entries of the logs are identical. So, let us take a look at these three logs. 
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So, here the color and this entry over here is the term number and the indices are 1, 2, 3, 4, 5, 6, 
and 7. So, the first row is the log of the leader, the second and third rows are the logs of the 
followers respectively. So, now what we get to see is that for the leader there are three terms. So, 


let us first consider the second row which is follower 1. 


So, in this case if we consider term 3 index 5, the entries are the same and so for the first property 
S1 if the index and the term match they have to have the same command which they actually do 
both are setting y to 5 and by S2 if two separate logs have the same index and term all the preceding 
entries of the logs are identical, which in this case holds true that all the preceding entries which 


are entries 1, 2, 3, and 4 are identical. 


Now let us consider the log for the second follower. So, in this case the first 4 entries are identical 
but the fifth entry for index 5 it has a different term number. So, for index 5 the term number 
should be 3 it is 2, so this entry is unmatched. Given the fact that this entry is unmatched, so we 
will say that this log is an unmatched log and so since, so it as per the log matching property we 
can say that this log is not acceptable to us. So, this is an unmatched state whereas all the logs are 


matching till the first 4 entries. 
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So, to ensure property S1 that the same index term tuple leads to the same command, what the 
leader does is it creates at most one entry at a given index in it a given index in a certain term. So, 


which basically means that the array is considered a monotonically increasing array and for a given 
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term, so the index keeps on increasing. So, essentially what the leader would do is that if the log 
is full till let us say the nth index it would add an entry at the n + 1th index and also attach its term 


number along with it. 


And given the fact that this entry is being replicated all the replicated logs would also have the 
same command at the same index otherwise the log matching property will not hold and we want 
it to hold and the way that we would actually do that is that the leader will essentially add the entry 
in its own log and then broadcast that entry in an append entries message with the index and all 


the followers have to insert it at that index only. So, that ensures property S1. 


Next, let us consider property S2 that for the same index term tuple which means that if two logs 
at the same index are the same term all the previous entries are going to match. So, to ensure this 
what the leader does is that along with an append entries message the leader sends the index and 
term of the previous entry in its log. What would that translate to here is that let us say this is the 


current entry for which the leader is adding a command for the previous entry. 


So, for doing this it would create an append entries message and in the append entries message it 
would essentially add if this is the command it will add this command with the index and the term 
it will also send the index and the term of the previous entry. So, the advantage of this is that if the 
follower does not find the previous entry in its log with a matching index term it refuses to accept 
the message and if it accepts the message it means that the previous entry matches so the current 
entry will also match. So, this ensures the log matching property essentially by induction that is 


why because the base case is that the logs are empty. 


So, the logs are the same across all the servers and the induction case is that only if the previous 
entry matches how does it match because the previous entry is sent along with every append entries 
message only if that matches we add a new entry. Otherwise, the follower refuses to accept the 
message. So, we will see what happens in this case. So, essentially this refuses this case has to be 


understood but if it does accept the message then the logs match. 


So, essentially our protocol is designed in such a way that properties S1 and S2 which are sub 
properties of the log matching property they continue to hold if a message is accepted. It is possible 


that because of crashes the followers logs will diverge, so it is possible that if this is the leader's 
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log and this is the followers log it might match till this point after that there might be a divergence. 


So, the divergence has to be fixed. 


So, we saw one example of the divergence over here which was not an acceptable entry. So, the 
divergence has to be fixed the way that raft would do it is that it would force followers to replicate 
the leader’s logs. So, this fixes the divergence. So, the leader is assumed to be always correct and 


the followers simply replicate the leader’s logs so this kills the diversions. 
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Reconciling the Log Entries 
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* The leader maintains a nextindex pointer for each follower 


* It is initialized to be equal to the index of the last entry in its log + 1 
[Assuming the logs are consistent] 


* Followers might indicate a divergence after receiving a message from the leader. The 
entries at (nextindex - 1) do not match. 


* The leader decrements the nextindex pointer and tries again 


* Ultimately the logs match. The follower appends all the remaining entries from the 
leader’s log. 


sThe leader never overwrites or deletes entries in its own log. It only 
VE pppends. 
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Now let us try to reconcile the log entries. So, consider a log over here with 3 terms and 7 indices. 
The leader maintains a next index pointer for each follower, what this essentially means that, it 
basically maintains the degree of how much the logs of a follower match with the leader. So, let 
me explain. So, the next index pointer is initialized to be equal to the index of the last entry in the 


leader’s log + 1. 


So, if this is the leader's log over here with 7 entries then the last entry + 1 = 8. So, it is assumed 
by the leader that all the followers have matching logs and the next index at which they will insert 
an entry is actually index number 8. Now followers might indicate a divergence after receiving a 


message from the leader. 


So, why would that be, well the reason that would be the case when the follower refuses to accept 
the message from the leader because the previous entry does not match. So, in this case what would 


happen is that the leader will be aware that a given follower there is a divergence. So, in this case 
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the leader would send entry number 8 but follower would say that look entry number 7 does not 


match. 


So, the leader in this case will decrement the next index pointer and try again. Ultimately the logs 
will match. So, ultimately the logs will match at some point. So, let us say they will match at this 
point till 0.4. So, one thing that the leader is sure is that the follower the first 4 entries match after 
that of course things diverge then when the logs match what the follower will do is that it will 


append all the remaining entries from the leaders log which are these three entries 5, 6 and 7. 


So, this ensures that the follower and the leader will ultimately come in sync but the way that this 
will happen is that the follower first has to indicate a divergence and the follower has to refuse 
messages. So, the leader will continue to decrement the next index pointer. It will then wait till the 
logs match and after that the follower will just copy the rest of the log entries from the leader. So, 
the leader never overwrites and never deletes entries in its own log, it only continues to append 


them. 
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_Leader Completeness Property C. an 


— y 
* Leaders will keep changing because of process crashes 6: 
* However, the new leader should have all the log entries of the old leader 


* Election restriction 
* Acandidate cannot win an election unless its log contains all committed 
entries 
* The <RequestVote> message includes information about the candidate's log 
* The candidate's log should at least be as up to date as the log of the voter. 
* Up to date check a 


| : 
* Check the last entries, The log with the higher term is more up to date. | 


| pee 


} * Ifthe terms are the same, the log with more entries is more up to date. i 


So, given that we have discussed this let us discuss more of the safety properties. So, the next 
property, so we have already discussed log matching, we have already discussed how the logs are 
reconciled if we discover a divergence. So, that also we have discussed. So, next let us discuss the 
leader completeness property. So, leader completeness property is like this that the leaders will 


keep changing because of process crashes. 


So, any kind of a consensus algorithm does take crashes into account. So, the leaders will keep 
changing. So, however the new leader should have all the log entries of the old leader. So, the new 
leader whoever is elected will have is expected to have all the log entries of the old leader. 
Otherwise, all the committed log entries, so, it should be the committed log entries of the old 


leader, if that is not the case there is a problem. 


So, we add an election restriction which is that a candidate cannot win an election unless its log 
contains all the committed entries. So, what is the idea? The idea is that look the leader is over 
here the leader keeps on sending messages to each of the servers once the leader is dead one of the 
servers needs to elected a leader but the server over here can only stand in an election only if it has 


all the committed entries of the old leader. 


So, this is a crucial and critical election restriction that we put to ensure that servers whose entries 
are not up to date do not stand in the election. So, the request vote message includes information 


about the candidates log. Particularly, the index of the last committed entry, the index of the entry 
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that was committed the last time. So, the candidate's log should at least be as up to date as the log 


of the voter. 


So, how did the leader commit an entry? The leader committed an entry because it was there with 
the majority of the voters. So, whenever let us say this candidate this server becomes a candidate 
it sends a request vote message to the rest too they will see if their log is more up to date as 
compared to this or not. So, the candidate's log which is this is the candidate should at least be as 


up to date as the voters only then will the election go through. 


So, the basic insight of this is that we want to ensure that whatever the leader has committed is 
present in the logs of the next leader is at least is present in the logs to the next successful candidate 
and since leaders only come out of candidates we are ensuring the leader completeness property 


via this. 


So, what is the up to date check? Well the up to date check is we check the last entries the log with 
the higher term is more up to date well that is common sense that the log with the higher term 
means it has seen more of more leaders come and commit their entries. So, the one with the higher 
term is clearly more up to date and if the terms are the same the log with more entries is more up 


to date. 


So, we will connect this with the number of committed entries later in the proof but this is our up 
to date check that what we do is that we check the last entries of the logs and clearly, the winner 
in this case is the one with a higher term and if the terms are the same, then the log with more 


entries is more up to date. 
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Committing Entries from Previous Terms 


* Assume a leader crashes 


* Let’s say it crashes before committing an entry, e, that is stored in a majority 
of servers. — 


t Anewleader willbe elected <4 » 
* This leader’s log has to be at least as up to date as a majority of the servers 
* Let us say it has e in its log 


* In the normal course of operation it will send e to the rest of the servers that 
do not have it. 


* What about an entry from a previous term that is uncommitted? 
* Should the current leader overwrite it, or commit it? “= 
* Prefer simplicity: Do not commit entries from previous terms. 
8) 
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We can have several corner cases when it comes to leaders crashing and then recovering. So, let 
us consider a few such cases there is an example in the paper, so, the viewers should be convinced 
of the fact that most consensus protocols are very hard to design. Particularly, if the leader crashes 


then there are many, many race conditions, many, many corner cases which need to be handled. 


It is typically not possible to manually verify all of them that is the reason many of these protocols 
are actually verified by automated checkers, automated verification engines that actually do a 
process of automated verification and go through hundreds of these corner cases and verify that 


for each of them the protocol works correctly. 


So, let us give a brief overview of what is it that can happen. So, assume a leader crashes. So, let 
us say it crashes before committing an entry e. So, what committing an entry basically means here. 
So, this is just not a single phase. So, this is actually a pretty loaded term. So, what the leader does 
is that it sends an append entries to all the servers only when they accept the message, is it actually 
sure that they have actually added it to the log and when it gets an acknowledgement or does not 
get a refusal from a majority of the servers it is sure that the majority of the servers have the entry 


in their log. 


And at that point the leader updates the last commit entry and then the entry is sent and then every 
subsequent message has the last commit entry has the value of the last commit. So, once the last 


commit is sent to the server it is also sure that the leader has committed a given entry. So, this 
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process is kind of implicit in raft and it is not that the leader commits an entry and announces it, it 
happens implicitly and this is a rather complex aspect of the protocol, which is hard to appreciate 


but there is an example and a section in the paper I would request the viewers to take a look at it. 


So, let me give an overview of what exactly it says. So, let us say before it commits an entry which 
means that before it has sent it to all the servers and it has gotten acknowledgments that the entry 
has been added the leader fails it crashes. So, in this case a new leader is elected, this leader's log 
clearly has to be as up to date as a majority of the servers where up to date is defined the way we 
have defined either the last entry has a higher term or the last entry has the same term ID but the 
leader just has more greater than equal to the leader size of the log is at least as large as the size of 


the voter. 


So, let us say it has the entry e in its log the new leader. So, what will happen is since the new 
leader does not delete or overwrite any entries in the normal course of operation it will send that 
the entry e to the rest of the servers that do not have it. So, this is what is going to happen in a 
normal course of operation. So, the rest of the servers will have the entry it is also possible, so this 
is one of the cases where this could be safely handled but in general if this happens in the normal 
course of operation is fine but in general if there is an entry from a previous term that is 


uncommitted. 


So, for the sake of simplicity what raft does is that it does not overwrite it, it does not committed. 
If the leader has the entry in its log, it will naturally propagate to the rest of the servers, if it does 
not have the entry it does not automatically committed and this is where simplicity is preferred 
that we do not automatically commit entries from previous terms. So, this is not an automatic 
process if it does happen in the course of this leader's operation it is fine otherwise it does not 


happen otherwise. 
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Follower and Candidate Crashes 


* Raft keeps trying to send <RequestVote> and <AppendEntries> to all 
crashed followers and candidates 


* All of Raft’s messages are idempotent 


* There is no harm if multiple copies of the same message are sent to the same 
server. 


* Timing requirements: 
* Broadcast time « Election time out « Mean time between failures 
* Broadcast time: 0.5 to 20 ms 
| * Election timeout: 10ms to 500 ms 


If the follower or the candidate crashes well that is an easy situation to manage. So, raft keeps 
trying to send request vote and append entries messages to all the crashed followers and candidates. 
And all of raft's messages are idempotent. What that means is that, there is no harm if multiple 
copies of the same message are sent to the same server. So, the request vote messages can be sent 
it is okay append entries messages can also be sent and since the same message can be sent at the 


number of times there is no issue. 


For example, if the append entries message is sent twice or thrice it does not matter if it has been 
added to the log once it will not be added a second time. So, of course there are some timing 
requirements the time it takes to broadcast a message for example has to be less than the timeout 


time of an election that that is very clear. 


Because if it times out before that, then we will just be doing broadcast after broadcast and the 
election time out time has to be much lower than the mean time between failures because if that is 
not the case then we will actually fail rather often as compared to the timeout and the election will 
never conclude. So, some typical examples that are there in the paper is that the broadcast time is 
between 0.5 and 20 milliseconds and election timeout is between 10 milliseconds and 500 


milliseconds. 
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Proof of Safety (Leader Completeness Property) 


* Say that in term T, leader, commits an entry, e 


* Ata later term U, leader, does not store the entry | Con la dic he n 


* Consider the smallest such U. U > T 
_—t»* e must have been absent from U's logs at the time of its election 


.* There must be some server, S) that accepted e (sent by leader,) and also 
voted for leader,. ; 


.* Sstill had e, when it voted for leader,. This is because all the intervening 
leaders had e in their logs (assumption, we chose the smallest U). 


* At the time of voting, leaders log must have been up to date 


+. If they had the same last term, then the log of leader, must have been 
& longer. It must have contained e. 

5 
ed 
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So, now let us provide a quick proof of safety of the leader completeness property. So, we have 
already described the log matching property. So, that is easy it happens by induction that is the 
way that the two sub properties S1 and S2 are actually handled. So, let us now discuss the leader 
completeness property how we would actually prove it in the next two or three slides. So, let us 


say that in term T the leader of term T, leader T commits an entry e. 


So, at a later term U, leader U does not have the entry. So, this would violate the leader 
completeness property and this is where we want to prove a contradiction we want to say that this 
situation will not happen. So, let us consider the smallest search term U where U > T where the 


entry e is absent in U is absent in the logs of U. 


So, the this will only happen if e is absent from U’s logs so the time of its election. So, when U 
was elected. So, it did not have the entry e. So, hence even after the election it does not have the 
entry e because the process of election per se does not add an entry to the logs. This means that 
there must be some server S that accepted e which was sent by leader T and that also voted for 


leader U. 


Why? Because essentially commit requires a majority and election requires a majority. So, there 
has to be an intersection a non-zero intersection and let some server S be in that intersection. So, 


that must have accepted as well as voted. So, this means that since U is the least such U it is the 
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mean we have taken the smallest such U, S still had e when it voted for leader U this is because 


all the intervening leaders had e in their logs which is by assumption. 


So, the server S still had e it was not removed. So, at the time of voting leader U’s log must have 
been up to date and this means it must have been up to date as compared to the logs of S. This 
further means that we must do an up to date check which means that if they had the same last term 
then the log of leader U must have been longer. It, and it must have contained e if it was longer it 


must have contained e which so this cannot be the case. 
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Proof of Safety - II oe 


—e* Otherwise, leader,’s last log term must have been larger than the voter's.) 


* The earlier leader that created,,leader,’s last log entry must have had e in its 
log (assumption). By the Log Matching property , leader,.’s log must also 
_ containe. = 
* Hence, the Leader Completeness Property holds. 
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Proof of Safety (Leader Completeness Property) 


* Say that in term T, leader, commits an entry, e 
* Ata later term U, leader, does not store the entry | (only dic tien 
* Consider the smallest such U, U > T 
—t* e must have been absent from U's logs at the time of its election 


* There must be some server, 5 that accepted e (sent by leader,) and also 
voted for leader,. 


.* Sstill had e, when it voted for leader,. This is because all the intervening 
leaders had e in their logs (assumption, we chose the smallest U). 


* At the time of voting, leader,’s log must have been up to date 


+. If they had the same last term, then the log of leader, must have been 
6) longer. It must have contained e. { 
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So, if that is not the case that only the other case is possible which is that leader U’s last log and 
log term must have been larger than the voters which in this case is S. So, the term of the last entry 
must have been larger than the term of the voters logs last entry. So, let us see, is this possible? 
So, this is not possible the reason is like this, the earlier leader that created leader U's last log entry 


must have had e in its log that is an assumption. 


So, by the log matching property leader use log must also contain e. So, it is important we go 
through this particular thing once again. So, let us explain this once again. So, if you would recall 
the leader completeness property says that look if a given entry has been committed by a previous 
leader it is there in the logs of all successive leaders. So, we say let this not be the case let us derive 
a contradiction. So, let us say some later leader U. So, some later term U whose leader is leader U, 


where U > T let it not have a given entry. 


If it does not have a given entry, it would have not had that entry at the time of voting as well and 
so, this means that there must have been some server S that accepted e and also voted for leader 
U. So, for this server S let us try to see what would have happened and derive a contradiction. So, 
what would have happened is, that the server S would have still had e at the time of voting and the 


leader U’s log would have been up to date. 


This means that if U and say this means that a leader U and S have the same last term. So, in their 
logs, if they are the same last term so consider their logs see if both of them have the same last 
term then it must be the case the leader U just has the longer log which means it has more entries 
at least it has greater than equal to 0 extra entries. So, then this means that from the log matching 
property if e is somewhere in here e has to be in the same position over here. So, there is no other 


option. 


So, since there is no other option this case is precluded. So, this could not have led to the 
divergence. So, what could have led to the divergence is that the leader U’s last term must have 
been larger than the voters last term. So, if I were to draw the logs once again. So, what would 
have happened is that let us say they are last terms if the term of this is T1 and the term of this of 


T2 then T1 > T2. So, this would have made the leader use logs more up to date. 


So, let us understand what would have happened the earlier leader sorry there will be no comma 


over here. So, the earlier leader that created leader U’s last log entry. So, let this be the last log 


427 


entry of leader U, what would have happened? Well, what would have happened is that the earlier 
leader would have had e in its log. So, that is by the assumption because we are assuming the 


minimum U the earlier leader that created the log for created the last log entry for leader U. 


So, basically what happens this is where the last log entry ends and then the earlier leader crashes 
and this is where leader U becomes the leader and its term starts. So, the earlier guy who created 
this entry must have had e in its log and then leader U took it from there. So, what would have 
happened is that because of the log matching property given that both the logs actually match the 
all the previous entries would have matched which you have already ensured by induction which 


means that just at the time of voting leader U would have had e in its logs. 


So, since the leader U will subsequently not delete an entry it will continue to have e hence we 
have a contradiction in both the cases and thus the leader completeness property holds. So, it is 
important that the viewer goes through this argument several times and appreciates all of its 
nuances. So, this is reasonably complicated to visualize and explain. So, that is why the viewer 


needs to go through the proof in the paper several times. 
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Cluster Membership Changes 


* Servers can get added or deleted from the Raft cluster. 
* The traditional approach is to stall the system. 
* Copy the logs to the new configuration. 


* Restart the system. bi 


* Raft does this without any down time. 
* Leader receives a request to change the configuration: C,; > Cre, 
| te | . 

* It creates a joint consensus mechanism ” 

* Creates a new configuration It broadcasts this message to all the 
servers. 

* Once the C,4 ,.y entry has been committed, all the servers have to respect 

i» 6s 4 2 $$$ 
__ this joint configuration. 


‘old,new" 
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A few miscellaneous issues. So, the first is that servers can get added or deleted from the raft 
cluster. So, a traditional approach would be to stall the entire system copy all the logs from the 
new configuration to the old configuration and restart the system raft does this without any 


downtime. So, what raft does is that if the servers are supposed to change it creates something 


428 


called a joint consensus mechanism, which is kind of like having two leaders at the same time who 


are working in unison. 


So, the leader receives, so what happens is that the old leader receives a request to change the 
configuration from C old to C new. So, number servers will change. So, then a joint consensus is 
created called C all new and so this configuration is broadcast to all the servers once the C old new 
entry has been committed which means a majority of the servers respect this joint configuration. 


So, this is a special kind of majority which basically means I withdraw the word majority. 


So, all the servers in this case have to respect the joint configuration. So, that is kind of simpler. 


So, if all the servers respect this joint configuration then we go forward. 
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Joint Consensus Mechanism 
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* During this period, the servers act ina special way Sold ~ “vo 
* Log entries are replicated in all the servers. 
* Any server (from C,,,, or C4) can be elected as a leader 


* We need separate majorities (election and entry commitment) in both the old 
and new configurations. 


* This ensures that the new servers in C,,.,, can get all the logs. New 
servers can also join as non-voting members. 


* After this the leader sends a message to all the servers describing 


& 1 Rh heme se a ae form gpa to start an election 
a ; ppen because all the logs till this message are committed. 
So, what is it that will happen in this case which is will happen in a special way. So, we first start 
with the C old configuration then we go to C old new which is a new joint configuration and then 
we go to C new which is anew configuration. So, during this period all the log entries are replicated 
in all the servers. So, any server from C new or C old during this transition period can be elected 


as a leader. 


So, for any commit we need separate majorities in both the older new configurations which 
basically means that for sending any log entry we need a majority in C old as well as a majority in 
C new. So, this ensures that the new servers in C new get to see all the logs and also their logs get 


up to date over time such that they can ultimately vote and win elections. So, initially they can be 
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non-voting members where they will just be reading and recording the log entries gradually when 
the logs start getting up to date, they will become voting members after this the leader sends a 


message to all the servers with C new. 


So, after this message is committed a server from C new is expected to start an election and win 
it. So, this will happen if you can stop the servers from see old in participating from the in the 
election and the reason it will win is because all the logs are till this message are expected to be 


committed. 


So, that is the reason the candidate will all that the candidate will be eligible because its logs will 
be at least as up to date as the rest of the servers. Sorry and then the new configuration will take 
over and the old servers so basically all the servers in C old minus C new which are essentially the 


servers that are not a part of the new configuration they can be shut off. 
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Log Compaction 


* Logs will keep growing over time. 


* When a log reaches a fixed size, take a snapshot. 
* Store a snapshot of the entire follower server in stable storage (/a7/  oluh, 
* Record the index, and term of the latest entry in the last snapshot 
* Then discard the stored logs from the servers 


* Servers take snapshots independently by the leader and followers 
* The leader may send snapshots to followers that are far behind 
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The other aspect is log compaction which means that logs will keep growing over time. So, when 
a log reaches a fixed size, what we can do is that, we need not store the entire log we can take a 
snapshot. So, a snapshot is just a copy of the log which can be stored in stable storage like a hard 
disk. We record the index and the term of the latest entry in the snapshot and we discard the logs 


from the servers. 


So, servers can take snapshots independently and furthermore, the snapshot can be used for another 


good purpose. So, assume a follower is really behind a leader. So, we have a follower but the 
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follower is really running behind a leader. So, in this case what the leader can do is a leader can 
simply send it a snapshot. It can read all the logs that it does not have and it can add them in its all 


the log entries it does not have and it can then add them in its log. 
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Client Interaction | 
t= 3) H te resp 


* Clients first contact a randomly chosen server, that directs them towards 
the leader 


* Linearizability consistency condition: 
* Each request appears to execute in a single instant at some point between its 
invocation and response 
* Every command is assigned a unique serial number by the client. Servers keep a 
record of it. If they see a command once again, they simply respond to it without re- 
executing the request. 


* Read-only operations Ut TTT TT — 
* Can be handled without writing anything to the log 


| © Before executing a read-only operation, the leader needs to ensure that it has at 
| least committed a single message in the current term, and it is still the leader. 


So, let us now discuss the way that the clients interact with the raft system. So, in this case the 
client first contacts the randomly chosen server. The randomly chosen server returns it the ID of 
the leader, the client then sends a request to the leader and gets the, and executes, and sends the 
command. So, command will be sent to the state machines either it can change the state which 


would be a write operation or it can read the current state which is a read operation. 


So, all distributed systems need a notion of correctness in this case we use the linearizability 
correctness condition which is stronger than sequential consistency. So, here the idea is that 
between the request and response of a command it appears to execute instantaneously at some 
point in the middle. So, this is stronger than sequential consistency and most distributed systems 


try to provide linearizability to some extent and so raft does provide linearizability. 


So, to ensure the correctness of this and to ensure that we have some idempotency of the 
commands. So, of course the read can be issued several times but we will see that is that has 
complications. But definitely let us say if we send a write and leader crashes, we do not get a 
response, we send the right once again then it should not be the case that this actually becomes a 


second right. 
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So, let us say that we were implementing a key value store and we set x = 3 but the client is not 
sure if it went through or not and another client sets x = 5. Again, the previous request is retried 
and we have x = 3. So, this is not a linearizable execution it is not appearing to happen at one 


instant. 


So, what we do is that we assign a unique serial number to each command. So, the client does that 
to every command it assigns a unique serial number. The servers keep a record of the serial 
numbers and if they see a command once again they simply respond to it without re-executing the 
request. So, this ensures that if this request is actually gone through the serial number would be 


there with servers and they would quickly return the response. 


Read operations are difficult in this case because we want to provide the property of linearizability 
and also we are not writing anything to the log which is reading its state. So, to read the state of 
the log before executing a read-only operation the leader needs to ensure that it has at least 


committed a single message in the current term and it is still the leader. 


So, being still a leader is important because that is the only way it can execute a request and 
committing a message in this current means that all the messages in the previous term of all been 
handled they have either been committed or they have been discarded. So, whatever is the correct 
state of the log that is present with the leader in the current term. So, it is in a position to actually 


execute the read request and the read request should always be on the committed state. 


So, what do you have to do for a read again, well we will have to commit at least one message in 


the current term by the current leader and then execute a read on the state machine. 
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So, this concludes our discussion on the raft consensus protocol. So, the raft consensus protocol 
similar to paxos does not give us consensus all the time, this is of course according to the FLP 
result which says that it is impossible to do so. And for when does it not give us a result, when we 
cannot elect a leader that is one of the cases, when we are not able to create a consensus. But 
however, we have safety properties in the sense it guarantees us that if something is agreed to, it 
is agreed to by a majority of the servers it is so there is a consensus and if there is any divergence 


it can be detected and it can be fixed. 
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Advanced Distributed Systems 
Smruti R. Sarangi 
Department of Computer Science & Technology 
Indian Institute of Technology, Delhi 
Lecture 14 
The Byzantine General’s Problem 


In this lecture, we will discuss Byzantine Fault Tolerance, which is also a method of 


consensus. It is known as command consensus. 


(Refer Slide Time: 00:30) 
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So, we will distinguish between two kinds of faults. So, the first kind of faults are normal 
faults which we have been seeing up till now. So the normal faults can be crash faults. In 
this case, the process just stops. Or, it can be a crash fault with some intimation. So it can 
let other processes know, some entity, typically another process knows, that it has suffered 


from a fault. 


So, we have essentially been dealing with such easy to manage faults where a process just 
stops. So where essentially a process just stops. That is pretty much all that we have been 
dealing with now. In some cases, other processes get to know. And in some cases, other 
processes do not get to know, like in flp result. But that has not significantly changed the 
way that we design our algorithms. But what we will see now is that we will have a new 


kind of faults, a new set of faults known as Byzantine faults. 
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So in, for a Byzantine failed node, everything is fair. So they can lie, they can collude with 
other failed nodes, and, so they can show up as crash failures, they can fake messages, they 
can forge messages. So they can pretty much do everything. So everything is fair. So they 
need not send, they need not participate in an algorithm, they can deny sending messages, 


they can just stop sending messages, they can lie. So for them, everything is fair. 


(Refer Slide Time: 02:18) 
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So the command consensus that we are looking for now, so this is inspired from the classic 
Byzantine General’s problem which was proposed way back in the distributed systems 
literature. So in this case, we have a commander. So the commander issues an order. So all 
of them are general. So when we use the term general, they might refer to any of the 


lieutenant generals. So, we have three lieutenants over here. 


So any of these lieutenant generals are also generals, and the commander is also a general. 
So the command commander essentially issues an order. The order is sent to all of the 
generals, and each general, furthermore, can either be loyal or disloyal, which basically 
means that if the general is loyal, the general does not have a Byzantine fault. And if the 


general is disloyal, then it has a Byzantine fault. 


So what does it mean? So let us assume we are trying to do a binary consensus where 


essentially the generals are in different camps and they send messages to each other. So the 
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message cannot be forged in the middle. So there are only 2 Kinds of messages, attack or 


retreat. So let us assume that the commander sends the attack messages to all of them. 


So what we essentially want is we want all the loyal generals to come to the same decision, 
whatever it is, attack or retreat. And of course, similar to regular consensus, it cannot be a 
dummy decision, in the sense that all of them cannot decide retreat or all of them cannot 
decide attack. We will see why in the next few slides. But essentially, the idea here is that 
if the commander issues an order, and assuming that the commander is loyal, all the loyal 


generals have to obey that order. 


So here is a fun part. So, anybody who would listen to this would say that, look, this is not 
hard at all. All that we need to do is we just need to compute a majority. So let us say that 
there are 2 k + 1 lieutenants. So if k + 1 of them are loyal, then pretty much all of them 
would have gotten the same orders. Even if there is lie about what they have gotten, at least 
we have a majority of k + 1. So even though this sounds rather reasonable, however, the 


main problem is that a commander himself may not be loyal. 


So, that is the main problem. So the commander is not loyal. What the commander would 
actually do is that it would send an attack message to the first lieutenant general, a retreat 
message to the second and then attack message to the third. So then they would sort of 
mutually get confused because the commander would have sent different messages to 
different people. And this would be very, very confusing. So in this scenario, if we still 


want all the loyal generals to make the same decision, it is rather complicated. 
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So let us now formalize these conditions. So the commander or any of the lieutenant 
generals can be disloyal. And here disloyal basically means the Byzantine failure, which 
means that they can behave in an extremely unpredictable fashion. So you cannot, you 
cannot, we cannot, nobody can trust them at all. So let us consider a synchronous algorithm. 
So note that this is not a synchronous because in a synchronous algorithm cannot even 
tolerate a single faulty process. And in this case, we want to tolerate a lot of faulty 


processes. 


So let us assume that there are total n generals. Out of that, we have one commander, and 
n minus | lieutenant generals, let us call them LGs. So the commander sends an order to n 
minus | lieutenant generals. So the commander himself, maybe disloyal. So in this case, 
we want to satisfy two conditions, IC] and IC2. So ICI is that all the loyal lieutenant 
generals obey the same orders. So which means that if the commander is over here and all 
of the lieutenant generals are over here, the subset of them that are loyal, they have to obey 


the same order. 


They come to the same decision. And IC2 says that if the commander himself is loyal, that 
every loyal general obeys the order, that the commander issues. So note that the important 


point over here is that nodes do not trust each other, processes do not trust each other. So 
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there is no way for a lieutenant general in this case to actually get convinced that the 


commander is actually loyal. So we have to ensure IC1 and IC2 implicitly. 


(Refer Slide Time: 07:43) 


Impossibility Results 


So now I will discuss few impossibility results, but before that, let us take a look at two 


cartoon videos regarding the behavior of general and a disloyal general. 
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Bad General: I am a bad general. I do not trust anybody. I have a Byzantine fault. I do not 
even trust myself. All that I do is just lie and lie. I lie to me, to you and to everybody. It is 


up to you to do something, not me. 


(Refer Slide Time: 08:32) 


Good General: I am a good general. I do what I am told. I never lie to anybody. However, 
unfortunately, bad generals give me wrong information. I thus need to verify this with other 


generals. It is true that I cannot trust anybody, but I never lie. 


(Refer Slide Time: 09:07) 
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wn No way to distinguish between these scenarios, General 1 chooses Attack 
(IC2). General 2 chooses RETREAT (similar argument). } 
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Byzantine Generals Problem — || 


* The commander or any of the lieutenant generals can be disloyal 
* Can have Byzantine failures. 
* You cannot trust them at all. 


* Consider a synchronous algorithm. 


* The commander sends an order to n-1 lieutenant generals. 
»* Condition |C1: All loyal lieutenant generals obey the same order. 


»* Condition |C2: If the commander is loyal, every loyal general obeys the order 
that the commander issues. 


So now, let us consider an impossibility result for three generals. So we claim that in a 
three general system with 1 commander and 2 lieutenant generals, it is not possible for 
them, for us to solve this problem in the sense that it is not possible for all the processes to 
come into a Byzantine consensus or a Byzantine agreement. So our claim is that in this 


case, no solution is possible. 


So let us consider two scenarios, scenario I, and scenario II. Let us is also consider the two 
conditions for correctness, which are ICI and IC2. So IC1 is all the loyal lieutenant generals 
obey the same order. And [C2 is that the commander is loyal, then every loyal general 
obeys, the order that the commander issues. So in this case, the idea is that the commander 


is a traitor. 


So the commander sends two messages. One message is attack. And the other message is 
retreat to General | and General 2, respectively. Now General | has no idea the commander 
is loyal or not. So the only other source of information that it has is actually General 2, 
which sends it a retreat message, which basically means that the commander had originally 


sent General 2, a retreat message. 


Again, General | has no reason for trusting General 2, but at least in this case, since we are 
seeing the entire scenario from a holistic point of view, we, we are assumed to be an all 


parts, powerful observer, we can clearly see the General 2 is speaking the truth. So now 
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General | is in a fix. It is getting conflicting messages from the commander and General 2, 


which means then that one of them is disloyal. It is just that it does not know which one is. 


Now, consider Scenario II where General 2 is the traitor. So in this case also, we get an 
attack message from the commander, and we get a retreat message from General 2. So 
General | has no way of distinguishing between situation | and situation 2, because it does 
not know who is actually the traitor. So now let us apply our known conditions. IC1 and 


IC2. 


So if you would consider the condition that says that if the commander is loyal, all the loyal 
generals need to obey the same order, so by this condition, General 1 needs to obey attack. 
And then if you consider the next condition, which says that no, we will, we will not use 
that condition right now. So, so we will also use this observation, that General 1 has no 
idea whether it is looking at situation | or situation 2. So in both cases, it gets the same set 


of messages. 


And, so since General 1 has to choose attack in situation 2, it has to choose attack in 
situation | as well. So this part is slightly complicated. So the readers, viewers, listeners 
need to go through this several times. So the important point that is being made here, if we 
actually take a look at condition IC2, that if a commander is loyal, all loyal generals have 


to obey the same order. 


So in this case, the commander is loyal, she commander sends attack, hence General | has 
to obey attack. But since General | has no way of distinguishing between scenarios I and 


II, even in scenario I, it needs to obey attack. 
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Now we have deleted all the ink from the slide. So let me now quickly summarize what we 
just derived. So what we derived is that from General 1's point of view, it is getting two 
messages, a attack message from the commander and a retreat message from General 2. So 
both the scenarios, scenario I and scenario II, it is not possible for General 1 to differentiate 


between the scenarios. 


So since we want a loyal commander's orders to be obeyed, so it needs to choose attack in 
both cases. So this is what we were just able to prove. So if we take a look at this case 
closely, we will see that whenever General | gets an order from the commander, it has to 


obey it. So in this case, it is obeying attack. And the reason is very simple. 


The reason is that if it is getting an order from the commander, it has no way of knowing 
if the commander is loyal or not, just in case it is loyal. Then it needs to obeys order because 
both of these situations, 1 and 2 look exactly identical to it. Hence, whether General 2 is a 
traitor or commander is a traitor, it is not possible for General 1 to know. Consequently, it 


needs to choose the order that the commander gives. 


So in this case, which is scenario I, it needs to choose attack. We can use a very similar 


argument, an extremely, the same argument for General 2. And we can prove with exactly 
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the same logic with the same set of reasoning, same steps, that in this case, General 2 needs 


to choose retreat because it has no other choice. 


It gets retreat from the commander. It has no idea if General | is a traitor or the commander 
is a traitor. And since it wants to fulfill the condition, IC2, which is that if the commander 
is loyal, then the orders of a loyal commander need to be obeyed. The only choice that we 
have in this case is that it needs to obey the order that comes from the commander, which 


is retreat. Same logic. 


So if you look at this, which is scenario I, once again, what we find is General | obeys 
attack, General 2 obeys retreat. So there is clearly no consensus. There is no command 
consensus over here. The loyal generals, if we see here, IC1, all loyal lieutenant generals 


obey the same order. This is not happening here. 


Since this is not happening here, we have derived a contradiction. And the contradiction 
basically says that in this case, which is with three generals and one traitor, we cannot 


obtain Byzantine agreement. Now, let us generalize this result to a larger system. 


(Refer Slide Time: 16:42) 
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So the general result is assumed we have less than equal to 3m generals where m > 1, and 
there are m traitors. So we want to prove that no solution is possible. So way the, the way 


that we will do it is that we will essentially create two categories of generals. So let us first 
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consider the simple, one of these cases, which where we have 3m generals. So we divide 


the general into clusters. 


So we just create clusters of one-third generals each, and to distinguish them from 
Byzantine generals, let us call them Albanian generals. So we will see in a second why? 
So we create three clusters and each cluster is simulated by a Byzantine general. So 
basically the cluster is simulated a Byzantine general. So we have three Byzantine generals. 
And we take the 3m generals, divide them into three equal size partitions, and we call the 


simulated generals, the Albanian generals. 


So let us, let me describe the meaning of simulation. So in this case, we are assuming that 
a protocol exists where with < = 3m generals and with at most m traitors, or let us say with 
m traitors, if we consider the worst case, we have a method of obtaining a Byzantine 
agreement. So what is this protocol? The protocol is essentially an algorithm to change the 


internal state of the Albanian generals, and a message exchange. 


So what we can do is each Byzantine general can simulate the finite state machines of all 
the Albanian generals that are contained within it, within its purview. So we will have three 
such by Byzantine generals, and so note that these are simulated Albanian generals. So 
they themselves are not honest or dishonest, so it all depends on the simulating Byzantine 


general. 


So here we make the same assumption as a previous slide, which I can show you right now 
where we have three generals, one commander, two lieutenants, and one of them is a traitor. 
So then, all that they do is that they simulate the final state machines and message 
exchanges of the Albanian generals. And of course, it is possible that a general here might 
send a message over here. So this would require the involvement of both the simulating 


Byzantine generals, this one, and this one. 


So what we want to prove is that if a protocol exists for Byzantine agreement with Albanian 
generals, then we can solve the Byzantine general problem with three generals and one 


traitor. So what we have seen in the previous slide that this is impossible. So we will use 
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this fact to derive a contradiction for this problem. So the notion of simulation should be 


clear at this point. 


So now let us go further. So one of these clusters will have the Albanian commander. So 
the cluster that Albanian commander, let it also be the Byzantine commander, and the rest 
of the two clusters will just have Albanian lieutenant generals. So let them also be 
Byzantine lieutenant generals. So the Byzantine commander will simulate the Albanian 


commander and at most m minus | Albanian generals. 


And each of the rest of the Byzantine generals, which means these two clusters, will 
simulate at most m Albanian generals. Simulate means simulate their final state missions. 
So since at most one Byzantine general can be a traitor, which means one of these clusters 


can be a traitor, at most m of these Albanian generals can be a traitor. 


That is because if the simulating Byzantine in general is a traitor, then the working of the 
final state machines of all of these simulator generals is suspected. So if, let us say, the 
Byzantine general simulating Albanian commander is a traitor, then that makes Albanian 
commander a traitor as well. So using the simulation logic, it is very easy to derive a 


contradiction for this problem. 
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So let us assume that we have a solution to the Albanian General's problem. This means 
we have essentially insured our two assumptions, IC1 and IC2. So what is the assumption 
again? Whether the Byzantine general is loyal, then that implies that Albanian general is 
loyal. And if the Byzantine general is a traitor, it implies that all the Albanian generals that 


it is simulating, they are also traitors. 


So now let us look at IC1. So this means that all the loyal Albanian generals obey the same 
order. So let us see. So since we are claiming that we have some protocol that ensures this, 
the two Byzantine generals that are simulating them will also obey the same order. So this 
will automatically ensure condition IC1, which is this condition for the Byzantine generals 
as well. So let me draw the three clusters once again, and just mark the Albanian generals 


that they simulate. 


So let us say that two of these simulating clusters are honest. Then by implication, all the 
Albanian generals that they simulate are also honest. So they will definitely come to an 
agreement. And by your assumption, by the IC] assumption, they will agree on the same 
value, either attack or retreat. So whatever they agree on, can also the agreement values of 


the simulating by Byzantine generals. 


So this ensures that condition IC1 holds for the Byzantine generals as well. Furthermore, 
if the commander is loyal, then the Albanian commander is also loyal. So what this further 
means is that all the loyal Albanian generals will obey this order. So that is assumption. So 


let us assume that the Albanian commander lies over here. 


So if the commanding Byzantine general is loyal, then the Albanian commander is also 
loyal, which means that all the loyal Albanian generals in each of these clusters will obey 
the same order that is issued by the commander. So then what we can do is that the 
Byzantine generals that are simulating these clusters can also obey the same order and that 
will automatically ensure condition IC2, that whatever the commander says, if the 


commander is loyal, then all the loyal generals also obey. 


So what have we just done? What we have done is that out of the solution for, so we have 


taken the solution for the Albanian generals problem, let, just call it the AG problem, and 
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we have derived a solution for the Byzantine general’s problem for, essentially for three 
generals and one traitor. So what do we do? So what we know now is that this is not 
solvable. And given that this is not solvable, the Albanian General's problem is also not 


solvable. 


So this essentially, we have obtained a contradiction over here, that if so, what did we 
assume? We assume that if there is a solution for the Albanian General's problem, there is 
a solution for the Byzantine General's problem, for three generals. So since the latter is not 
true, the former is also not true. Consequently, it is impossible. So how do we make sense 


of this result? 


Well, let us say that we have 6 servers. Out of 6 servers. Let us say 2, have a Byzantine 
fault. It is clearly not possible for the rest of the 4 correctly executing servers to actually 
come to an agreement. So what you actually need is that if m = 2, you need 3m + 1, or 7 


servers. So if you have 7 servers, then it is possible to arrive at an agreement. 


(Refer Slide Time: 26:53) 
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So now let us present the Byzantine agreement algorithm. It is also called a command 
consensus. That is because there is one commander and the commander wants his 


command to be agreed, agreed to by all the loyal generals. And of course, if the commander 
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himself is a traitor, then the loyal generals will at least come to an agreement between 


themselves. 
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So we have to make several assumptions. So the assumptions that we make are like this, 
that every message is delivered correctly, which means that it is possible to check for 
message integrity. So no general in the middle actually spoils or tempers a message. So let 


us say General | is sending a message to General 5, the message is delivered correctly. 


Furthermore, the receiver of a message knows the identity of the sender. So it is not possible 
for a general to actually fake the identity. So this is a key assumption we need to make that, 
and in practice it is possible to do so using digital signatures, it is not possible to fake an 


identity. 


Now the assumption A3 says that it is possible to detect the absence of a message. So since 
we have a synchronous algorithm, in a round, it is possible for a failed node not to send a 
message, but this can be detected. So typically, you, the rest of the nodes can assume a 


default value. For example, they can assume that retreat has been sent. 


The last is we assume a majority function, v | to v n, that returns the majority value. So let 


us say given a set of values, it returns the value which enjoys a majority in the set. And 
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clearly if a majority does not exist, then we return and default value, which can be retreated. 


So the majority function is important, and we will use it. 


(Refer Slide Time: 29:02) 
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So let us look at the steps of this algorithm. So this algorithm is actually a recursive 
algorithm. So we need to first define the base case, which is OM 0. So that is the name of 
the agreement algorithm. So the commander sends its value to each lieutenant. So that is 


the first step. So each lieutenant accepts the value or accepts retreat if no message was sent. 


So what is 0? 0 is anumber of traitors. So if there are no traitors, then of course it is assumed 
that the commander is loyal. So each lieutenant will simply accept the value that the 
commander sends. And just in case no message was sent, then you accept retreat. But in 
any case, aS we can see assumptions IC1 in IC2 tend to hold. And that is mainly because 


there are no traitors. 
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Now, that is considered OM m, so where m is the number of, so consider that this is greater 
than 0. So we have a total of n generals and that includes the commander. So the first step 


is the same that the commander, so let us say these are, these are all the lieutenants. 


So the first step is the same that the commander senses his value to every lieutenant. But 
of course, the lieutenant can get different values. So let lieutenant i receive value v i from 
the commander. So that this be, lieutenant, let us have a, well, this, I should have given a 


different name. So let us say this is lieutenant Gi, and it gets value vi from the commander. 


So now from lieutenant i’s from point of view, it has no idea if the commander is loyal or 
not. So the only thing that it can actually do is it can check with others. So how does it 
know that the same value was sent to other lieutenants? Well, it does not know 
immediately, but after some message exchanges, it will get to know. So it starts a recursive 


algorithm with a smaller argument. 


So it starts an algorithm OM(m — 1). So this is very crucial. So what is crucial is that let us 
say if a message comes to a certain lieutenant, it starts a recursive algorithm with a reduced 
subset, which is essentially the set of lieutenant other than the commander. So the, what is 


the aim of starting this? 
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The aim of starting this is to basically verify and understand what is it that they have gotten, 
and try to arrive at some kind of an agreement, at least between the loyal lieutenants, to at 
least see what is it that they have gotten. So this is very similar to an office situation where 
we have one of those mischievous bosses who tell, who tells different things to different 


people. 


So the only way for the people to actually is to just ask each other, what is it that the boss 
told them. So this is extremely similar over here, that we have the commander sending 
messages to different lieutenants. Once the lieutenants get the messages, the only thing that 
they can do is they can take the commander out the picture and run an algorithm between 
them to understand what is it that they have gotten? and what is it that they should agree 


upon? 


So these loyal generals in OM (m — 1) need to come to an agreement about the value that 
has been received by Lieutenant i. So what is the idea here? Well, the idea here is very 
simple, that let us say if we had a total of n generals at the beginning, we create a smaller 
subset, which is n - 1. And there what the ith lieutenant tries to do is that it broadcast the 


value that it got, which is vi from the commander, to the rest of the nodes. 


So what is it trying to do over here? Well, it is trying to initiate another Byzantine consensus 
where it is trying to convince everybody that it has actually gotten vi from the commander. 
So it is running a Byzantine algorithm precisely, to do that. So after OM (m — 1), what is 
the final state? Well, the final state is that the rest of the lieutenants in this set come to an 


agreement about what is it that Lieutenant 1, 


General 1, so general and lieutenant, we will use interchangeable, so what is the value that 
Lieutenant i has actually cotton from the commander. So this is something that, so this is 
again a Byzantine problem because it is possible that different nodes in this set might be 


dishonest. It is possible that Lieutenant 1 might be dishonest. 


So different combinations are possible. So again, it is a smaller instance for the same 
problem, but after that, at least all the loyal generals in this set will mutually agree on what 


is it that Lieutenant i actually got. If Lieutenant 1 is honest, that that value has to be vi. If 
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Lieutenant 1 is a traitor, then it has to be some other value, but at least all the loyal generals 


in this set will agree on that value. 


So it does not matter if Lieutenant 1 either by himself is loyal or not. That is not important. 
The rest of the loyal generals simply have to agree on the value that i got from its 
commander after OM (m — 1) finishes. And this is where they have to follow the IC1 and 


IC2 assumptions. 


IC1 is what is just written that all the loyal generals come to the, come to an agreement. 
And IC2, if I were to just take a little bit back [C2 is essentially that if the commander is 
loyal, every loyal general obeys the order. So which in this case, would basically mean that 


every loyal general in this set agrees on the value vi. 


So what is the big picture? What is the overall picture? Well, the overall picture is that OM 
(m) starts with the commander broadcasting its value to a set of lieutenant. So if there are 
total n generals in the beginning, we have (n — 1) lieutenant, all (n — 1) get the values of the 


commander. If the commander is dishonest, it sends different things. 


Now, they need to chat and gossip among themselves to figure out that what is it exactly 
that the commander is sent? So what each of them does is that each of, each one of them 
starts an instance of OM (m — 1). So how many instances in total, (n — 1) instances in total, 
and in each instance, the ith lieutenant broadcast the value of vi that it got from the 


commander in the beginning of OM (m). 


And the rest of the loyal generals have to agree on this value, and what are they agreeing 
on? They are agreeing on what? The ith general has actually caught him. Essentially, that 
is what they are agreeing on. And the agreement has to follow the IC1 and IC2 correctness 
conditions. So, OM (m), essentially ends up making (n — 1) calls to OM (m — 1). And 
similarly OM (m — 1) ends up making (n — 2) calls to OM (m — 2). So we have a pretty 


much a factorial like complexity over here. 
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So a little visualization over here that the commander at the beginning broadcasts the 
values, its value to everybody. So let us say that any general does not accept the value that 
the commander sends. So it simply does not accept. Instead, it asks the rest of the lieutenant 
and comes to an agreement regarding the value sent by the commander. So essentially it is 
written in an imperative style, that you do not accept the value and you actually ask the rest 


of the lieutenant, what is it that they have gotten? 


And so, for that reason, it is necessary to broadcast, but again, as we have seen a simple 
broadcast is useless. That is because generally n itself can be dishonest. So this, again has 
to be a smaller instance of the original algorithm, which is OM (m — 1). So of course, in 
this case, the commander is not included. So at the end of this, if this had gotten the value 
vn, everybody would come to a Byzantine agreement regarding the value that general n 


got. 
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Finally, step 2. So in step OM (m — 1), each general receives a total of (n— 1) values. Well, 
how is that? It is like this, that the commander actually senses the (n— 1) lieutenants, each 
one of them starts an instance of, OM (m — 1). So, if I were to consider any node, any 
process, any general, it would get one value from the commander, and it would get (n — 2) 


values from the rest of the lieutenant, making it a total of (n — 1) values. 


So, which is exactly what is written here, that from (n — 2) lieutenant generals, it gets (n — 
2) values. So of course, it gets values, I am not referring to the value that was broadcast. I 
am essentially referring to the value that was agreed upon after running OM (m — 1). So, 
essentially after these OM (m — 1) algorithm is concluded. So, if let us say there are four 


nodes, so it will be clear about the value that v1, about the values v1, v2 and v3. 


It will clearly know what are these values. And of course these nodes could be dishonest, 
but then as we have seen, there will still be an agreement among the loyal generals 
regarding the value it got, which incidentally might not be the value it actually got, but at 
least there will be an agreement. So after this, what we do is we have a total of (n — 1) 


values which are there with each of these lieutenants. 


So where, why (n— 1), one it got from its commander at the beginning of OM m, and (n — 


2), it got after running the individual instances of the reduced Byzantine problem. So once 
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at the end, so basically, if I were to consider the OM m algorithm, so initially we have a 
broadcast stage where the commander just broadcasts. Then, we run n minus | instances 


of OM (m — 1). 


In these instances, all of these values are agreed upon. And after that, we have a set of (n — 
2) values with each, for each i. We, we, we have a set of this, (n — 2) values + 1, which is 
what was received from the commander for each of these is. So we compute a majority 
value, which is, we compute the majority function. And whatever we get, that is used as a, 


as the output of the algorithm OM m. 


So for each i what is it that we do? What we do is at the beginning, the commander 
broadcasts its value to each of the lieutenants. Each lieutenant initiates a round of OM (m 
— 1). Fair. It tries to obtain an agreement over the value it got from the commander. After 
that has happened, we end up running (n — 1) instances of OM (m — 1), which means 


essentially, for each left end, we end up agreeing upon (n— 1) values. 


These many, for each lieutenant we agree upon these. So out of this, of course, one value, 
the lieutenant would have received from the commander in the broadcast stage and the rest 
(n—2) would be received because of the other instances of the reduced problem. So once 


it has these, it can then compute the majority of the set. 


It computes the majority of the set. And let us say the output of that is v. So v is pretty 
much the output of the OM (m) algorithm. And this v is stored in each other lieutenants. 
So this is clearly a recursive algorithm with a very factorial like field that initially we run 
(n — 1) instances, and then (n — 2) instances, (n — 3) for each. So this, so clearly the 


message complexity is high. So, that is true. 


Even though the time complexity is not that high. I mean, it would be incorrect to say that 
it is not that high. It all depends upon how many messages and how many computations 
we can fit into a single round. But what should be important is that the number of instances 
of these, these algorithms and the sheer number of messages sent, that clearly goes up in a, 


as a factorial function. 
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So the crux of this algorithm is that nodes do not, generals do not trust their commander. 
Hence they are not willing to accept any order at face value. For every order, there is a 
need to consult the others and then come to an agreement. And because of this consultation, 


we have to run smaller instances of the same Byzantine Agreement algorithm. 


And we expect that the majority function that we compute, which is essentially this step 
over here, has some favorable properties. And what are the favorable properties? So I am 
using the American spelling here and ignoring the u. So the favorable properties are that 


IC1 and IC2, the two correctness conditions that we have, both of them should be satisfied. 
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So let us give a simple example because the algorithm is, even though it is simple, it is still 
reasonably complicated. So General 3 is the traitor here in this case. So, essentially what 
happens? At the beginning, we are going to run OM (1), which means the commander 
sends its value v to everybody. After that, each one of them runs an instance of OM (0), 


which is as simple as just simply sending the value. 


So Iam not showing all the messages, I am only showing those messages that are relevant 
for General 2. So General 2 gets the message v, v and x. So as you can see, it would 
compute majority of v, v and x. So, so where x can be anything. So x is a placeholder for 


any message, because it is coming from a dishonest, from a traitor node. 
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So the majority of v, v and x is clearly v. So which means that General 2 is going to agree 
on v. We can do something very similar for General 1. Here also, we will find it will agree 
on v. And of course, the commander, since it is loyal, it will also agree on v. So we have 


the Byzantine agreement conditions here. 
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Now, if I consider a disloyal commander, so let us assume that in the first round it sends 
messages, x, y, and z to General 1, 2 and 3. So then in the second round when each of them 
run the reduced instance of the Byzantine problem, General 1 will send x to General 2, 
General | will also send x to General 3. General 2 will send y to generals | and 3, General 


3 will send z to General 2 and General 1. 


So, if you actually look at it, each of these generals actually receives the same set of 
messages, x, y, and z, x, y, and z, x, y, and z. And all of them compute the value of the 
majority function. So the majority function does not depend upon the order of the inputs. 


So all of them will essentially compute the same value, which is majority x, y, z. 


So in this case, of course, condition IC2 is not relevant because a commander is not loyal, 
but at least all the loyal lieutenant generals who are generals 1, 2, and 3, all of them are 
going to compute, are going to agree on the same value, which is majority of x, y, and z. 


Whatever it is, we do not care because it will be the same for all of them. 
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Proof 


So now that we have seen the algorithm, it is time to move on to the proof. So the proof is 
rather elegant, and it is important to appreciate it because algorithm was pretty complicated. 
So unless its proof is also understood, it will make, it will be hard to retain the key concepts 


of the algorithm. 
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So this is kind of a loaded slide, so we will go slow. So we will first prove a lemma. So for 
any m and k, where m is the round. So of course, m is not the number of traitors here. So 
let us not confuse ourselves with that. So for any m and k, OM m satisfies IC2, IC2 was 
the commander is loyal, all loyal generals obey his order. If there are more than 2 k + m 


generals, and at most k traitors. 


So if we have k traitors and, and more than 2 k + m generals, then we can say that, OM m 
will satisfy IC2. So let n be equal to the number of generals, n > 2 k + m, where k is the 
number of traitors, and m is the round number. So the assumption here is that the 


commander is loyal because if the commander is not loyal, then [C2 is not relevant. 


So it broadcasts value v to the rest of the nodes. If m = 0, this is trivially satisfied because 
all the nodes simply accept what the commander sent. Assume it is true for n - |. Let us try 
to prove that it is true for n. So we are essentially using induction. So in induction, the base 


case is satisfied. Now we want to prove the induction case. 


We want prove the induction hypothesis. So we are assuming it is correct till n - 1. So step 
1 is commander sense of value to n - | lieutenants. This step is correct. So now let us 


consider step 2. Each loyal lieutenant applies OM (m — 1) with n - | generals. So we have 
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the simple math over here that if n > 2 k + m, which is our assumption, (n— 1) > 2k+n- 
1. 


So the same assumption that we have made over here, the exactly same assumption can be 
made for OM (m — 1), which means the number of generals for OM (m — 1) is and - 1, that 
is >2k+n- 1, which means that OM (m — 1), the, since the basic assumption is in place, 


the induction hypothesis, it runs correctly. 


So in OM (m — 1), if let us say, one of the loyal left is as a commander, all other loyal 
lieutenants will also agree on the value v. So the commander is loyal, it sends value v to 
all the loyal lieutenants. And if one of them becomes a commander in OM (m — 1), given 


the induction hypothesis, all other loyal lieutenant will agree on v. 


Now, here is the catch. We assumed that in this case, m> 1. So if m= 1, and we put it over 
here, what we are essentially going to get, or, or let us say, if we put it over here, we, we 


will get that (n— 1) > 2 k. If m= 1, this follows from here, this result follows from here. 


This means that a majority of the lieutenants are actually loyal. Why? Because we have k 
traitors and since (n— 1) > 2k (n- 1 —k) is essentially > k. So a majority of lieutenants are 
actually loyal in this set. So loyal lieutenant 1 receives v from every loyal lieutenant j in 


OM (m — 1), which is an assumption. 


Out of the (n — 1) generals participating in each instance of OM (m — 1), a majority are 
going to receive v. Hence all the loyal generals will decide on v. So it is important that I 


elaborate on this part a little bit more. 
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So what was the assumption over here? Well, what our assumption was that n > 2 k +m. 
So assumption here was that n > 2 k + m, and this was an algorithm. Now, if I consider a 
reduced algorithm, n - | is greater than, so this means that essentially in a lower round, the 


same hypothesis also holds. 


And by our induction assumption, we assume that OM (m — 1) works correctly. So if I were 
to look at this expression over here, if m > 1, that implies m - 1 > 0. So, so this means that 


if m is greater than, let us say, m> 1, som - 1 > 0, so we will then say thatn-1> 2k. 


That further implies, so what we have, we have a total of n - | loyal lieutenants. Out of 
that, we have k traitors and we have n - | - k loyal generals. So these k traitors, we do not 
really care, but the n - | - k loyal generals that we have, they are the ones that we really 


care about. 


So in this case, what we have shown is that this number, which is the number of loyal 
generals is greater than the number of traitors. So these are essentially the number of loyal 


generals, and these are the number of traitors. So since the commander, so, now let us see. 
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Lemma 1 


For any m and k, OM(m) satisfies IC2 if there are more than 2k+m 
cea generals and at most k traitors. Let n = #generals. n > 2k +m 
Proof 


* Assumption: commander is loyal. It broadcasts value v. 

* For (m=0), this is trivially satisfied. 

* Assume it is true for (m-1), prove it is true for m (use induction) 
* Step 1: Commander sends a value to (n-1) lieutenants 


* Step 2: Each loyal lieutenant applies OM(m-1) with (n-1) generals 
* n>2k+m > (n-1) > 2k + (m-1) [OM(m-1) runs correctly] 
* In OM(m-1) if a loyal lieutenant is a commander all other loyal lieutenants agree on v 


* Since (m >=1), (n-1) > 2k. Thus a majority of lieutenants are loyal 
* A loyal lieutenant j receives v from every loyal lieutenant j in OM (m-1) [assumption] 
¥3 * Out of the (n-1) generals participating in each instance of OM(m-1), majority receive v 
were, * Thus all loyal generals decide v(IC2) a 


Since the commander is loyal, what the commander essentially did is it sent its value v to 
everybody. Out of that, the loyal generals, let us assume these are the ones, and let us say, 
this is a traitor. So, so these are the ones that, so I am not showing all the general. So these, 


so these are the ones who will again, take the value v and again, broadcast it in OM (m — 


1). 


So, one thing that is clear is that all the loyal generals are going to send v. And between 
any two loyal generals, they will also receive v. So every, if I were to consider a loyal 
general over here, it would receive n - | - k values for v, and the remaining it will get from 


traitors, which can be x and which we do not care. 
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But since this number, the first number is greater than the second, any majority function of 
these values is going to return the value v, which is essentially what we care about for a 
loyal general, that all the loyal generals would receive this, and thus, they would decide on 


v. And this is exactly IC2, which again, proves the statement of the lemma. 


So what was the statement of the Lemma, once again? It was that for any m and k, OM m 
satisfies IC2, if there are more than 2 k + m generals and at the most k traitors. So this, we 
have been able to prove via induction. I would request the listeners to, the viewers to go 


over this proof once or twice till it becomes reasonably clear. 


(Refer Slide Time: 58:25) 


Theorem 1 


With at most m traitors, and with more than 3m generals, OM(m) 
satisfies conditions IC1 and IC2. 


am 


Proof 
* If (m= 0), this is trivially true 
* Assume it is true for OM(m-1), prove for OM(m) 
* Assume the commander is loyal 
* Take (k = m), and use Lemma 1 to prove that OM(m) satisfies IC2 
L. © It trivially satisfies IC1 — 
* Commander is not loyal m 73m-| 73lm-) 
* At most (m-1) lieutenants are traitors 
* OM(m-1) runs on more than (3m -1) generals om/m/ 
| © 3m-1>3(m-1). Induction hypothesis can be applied to. OM(m-1) 
& * Allinstances of OM(m-1) thus satisfy both IC1 and IC2 
UX der, ot 


NPTEL 


463 


Lemma 1 


For any m and k, OM(m) satisfies IC2 if there are more than 2k+m 
Ea generals and at most k traitors. Let n = #generals. n > 2k +m 
Proof ‘) 


n 37 
* Assumption: commander is loyal. It broadcasts value v. 
* For (m=0), this is trivially satisfied. 
* Assume it is true for (m-1), prove it is true for m (use induction) 
* Step 1: Commander sends a value to (n-1) lieutenants 


* Step 2: Each loyal lieutenant applies OM(m-1) with (n-1) generals 
* n>2k+m > (n-1) > 2k + (m-1) [OM(m-1) runs correctly] 
* In OM(m-1) if a loyal lieutenant is a commander all other loyal lieutenants agree on v 


* Since (m >=1), (n-1) > 2k. Thus a majority of lieutenants are loyal 

__* Aloyal lieutenant j receives v from every loyal lieutenant j in OM (m-1) [assumption] 
& ¥ + Out of the (n-1) generals participating in each instance of OM(m-1), majority receive v 
ee * Thus all loyal generals decide v (IC2). L 


Given that we have proven this key lemma, so now we can prove the theorem. So the 
theorem says that with at most m traitors and with more than 3m generals, which means > 
3m + | generals, OM m or Byzantine algorithms satisfies the conditions IC1 and IC2. Well, 


if m = 0, this is trivially true, as we are just seen, there are no traitors. 


So assume it is true for OM (m — 1), again, proof by induction with the base case being 
proven here. And let us try to prove it for OOM m. So assume that the commander is loyal. 
So in this case, in the previous lemma, I take k = m and I use Lemma 1. So with Lemma 1, 


if k = M, then m > 3m. 


And with at most m traitors, we use this to proof that it satisfies IC2. And if it satisfies IC2, 
which means that if all the loyal generals obey what the commander is saying, and the 
commander is loyal, this trivially satisfies IC1. So because of the previous lemma, proving 


this condition where the commander is loyal, actually became trivial. 


Now, let us consider the case for the commander is not loyal. So again, several simple 
points. At most m - | lieutenants are traitors. So that is because there are am traitors. And 
one of them is a commander, so n minus | left lieutenant are traitors. So, OM (m — 1) runs 


on, so for OM (m — 1), it runs on more than 3m - | generals. 
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And 3m - | > 3(m-— 1). So the same assumption we are making at the top, the same holds 
for OM (m — 1). And if the assumption holds, then the basic conditions for OM (m — 1) are 


there. So the induction hypothesis can be applied to the reduced Byzantine problem. 


And the induction hypothesis will then basically say that all instances of OM (m — 1) will 
satisfy IC1 and IC2. So let us make this assumption and try to prove the induction case 


where OM (m) also satisfies IC1 and IC2. 
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So what this means is that in Step 3, any two loyal lieutenants get the same values for each 
instance of OM (m — 1). So if I were to explain the Byzantine algorithm to anybody, I 
would say that this is by far the most crucial and critical step, which in my view should be 


visualized like this, that if these are all the lieutenants. 


And if Tam assuming that OM (m — 1) runs correctly, then what happens is that in a second 
step we have (n — 1) instances of this algorithm, and pretty much after the algorithm ends, 
pretty much the output of each algorithm, it is not explicitly sent because let us assume that 
this is, in any given Gi. So it is not explicitly sent, but what we have seen is that this is 


inferred, but everybody infers the correct thing. 


So now, I am assuming that in OM (m — 1), the induction hypothesis, it runs correctly. Gi 


was a party to, (n — 2) instances of this algorithm being initiated by its fellow lieutenants, 
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and also it received one value from its commander, which it used to seed an instance of 
OM (m — 1). So if I were to add that, all the (n — 1) values that it gets, assuming that if Gi 


is loyal and another Gj is loyal, we claim that they get the same set of values. 


This is very important, and this is the crux of the entire proof. And the question is, why 
would they not actually get the same set of values? So they will not get the same set of 
values pretty much under the following circumstances. So let us try to understand what are 
those circumstances which might lead us to think that they will actually not get the same 


set of values. 


So let us, there is a need to elaborate on this. So let us maybe draw another diagram over 
here. And let us only consider two of these loyal generals, Gi and Gj, and there are a few 
more in the middle. So for all the generals in the middle, each one of them initiates a round 
of OM (m — 1). And since we assume that they run correctly, by the IC] assumption, 
whatever value that let us say, another vk got, both Gi and Gj both of them are going to 


agree on this value of v k. 


So then, let us consider the value vi that Gi got. Since Gi will be a commander of its own 
instance of OM (m — 1), and it will send the value to Gj, and so it will essentially initiate 
round of OM (m — 1), Gj is also going to agree that Gi actually got vi because of IC2, and 
the same will also hold for Gj. So this essentially means that for each of these individual 
sub algorithms, algorithms with a reduced input, all the loyal lieutenants will actually end 


up with the same set of values, for the same v 1 to vn - 1. 


It is going to be the same set. There is no reason why it should be a different set. And the 
reason is very simple that for each of these OM (m — 1) instances, if it is running correctly, 
each of it will correctly produce a Byzantine agreement. So whatever it is that they have 


actually gotten from their commander will reflect in the set of the other loyal generals. 


And it is possible that one of them can be a traitor, but that does not matter because all the 
loyal generals will come to an agreement of about what the traitor must have gotten. It does 
not matter if the traitor got that or not. As long as the rest of the loyal generals agree, IC1 


is satisfied, which is anyway, what we care about. 
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Given that the vector of values is the same for all the loyal lieutenants, the majority of this 
vector, whatever it is, will be the same for all the loyal lieutenants. And that is the key, that 


is the crux. So if you go back to our example, something very similar was happening. 


So, essentially the majority of the same set, majority of the same set or vector will be the 
same for everybody, for all the loyal lieutenants. So they are going to compute the same 
majority value. And hence, thus, we have an agreement. So with m traitors, what this means 


is we need at least 3m + | generals. 


And then we showed an algorithm, we showed that it is possible to get byzantine 
agreement. So this particular algorithm that we showed is actually very, very expensive in 
terms of computation, in terms of resources. And it is essentially a cursive algorithm that, 
whose complexity is essentially a factorial complexity. So it is clearly more than 


exponential. 
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Solution with Signed 


— Messages 


NPTEL 


And so that is the reason typically in most instances of Byzantine consensus, we make 
some simplifying assumptions. So let us make one search assumption, which is a solution 


with signed messages. 


(Refer Slide Time: 67:43) 


New assumptions (9:0) Fe: 


* Let us add the notion of signed messages 

* This will be useful when we discuss cryptocurrencies 

* Assumption A4) 
* A loyal general’s signature cannot be forged. Any alteration can be detected. 
* Anybody can verify the authenticity of a general's signature. 


* If we have m traitors, we can have a solution for any number of 
generals _ 


* Let the commander be General 0. 
* The notation (Xi means the value x is signed by general i 
* The ee ae that first i signed the message and then j 
G * Signatures are always appended 
NPTEL 


So here, what we will do is that we will create an additional assumption and we will find 


that such kind of Byzantine solutions with signed messages are common in the world of 
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cryptocurrencies. So that we will have a separate lecture for that. Here we add an additional 


assumption, which is A4. So, we claim that this will make our life significantly easier. 


Here, our claim is that a loyal general signature cannot be forged. So it is not possible to 
forge the signature of a loyal general. We can detect any kind of an alteration. So what this 
furthermore means that if let us say the commander is sending a value v, the value v cannot 


be tampered. 


So if let us say the commander is sending value v, and the commander is signing it with, 
so we will use the colon. The left side of the colon is a value, and the right side is a signature 
of who is signing it, and let the commander be general number 0. So if there is a message 
of this type, this message cannot be tampered with. So this, there's no way that this message 


can be forged or tampered. 


So if this is what the commander has said, well, this is what it is said. So we will use a 
notation x: i to basically mean x is the message, which has been signed by General 1. 
Something of the notion x i: } would mean that, well, the contents of the message are x, it 
was first signed by i then by j. And of course, this, this particular sequence, this is not 


tamper able. It is not possible to with it. 


All that we can do is we can append signatures. And we will see why it is done this way 
when we discuss more about cryptocurrencies and so on. But at the moment, the contents 
of the message and also the existing set of signatures cannot be changed. Only thing that a 


node can do is append to this list. That is all. 


So we will find that if we have such kind of a facility available, then it is possible to have 
a significantly simpler solution for this problem of Byzantine generals. That is mainly 
because we will, this non-tamperablitiy will give us, will give us the ability to significantly 


simplify the way we do things. And we do not have to run this factorial style competition. 
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Assumptions - II x: 


nd 


* Let the set V, be the set of properly signed orders received by process i 
* Initially V;= p.,Assume a maximum of m traitors. 


* Assume a function: choice(V). Visa set of values. 
* If Vcontains a single element, v : choice(V) =v 
i choice(@) = <default_value> 


° choice(V) is fixed for a given V. It does not depend upon the order 
of elements within V. 


So a few more assumptions. So let V i be the set of properly signed orders. Well, a properly 
signed order is essentially the value of the order, plus a range of signatures. And this entire 


message has not been tampered with. So then it is a properly signed order. 


So initially for each general, V 1 = @, it is empty. Additionally, assume a maximum of m 
traitors, which is a standard assumption that we have been making up till now. Assume a 
function choice of V. So V is a set of values and choice of V has certain properties. Let us 


see what it is. If V contains a single element, choice of V is that single element, small v. 


If the set V, capital V is empty, then this is a default value. So, choice of V otherwise is 
fixed for a given vector. So, for a given vector or set of values, the choice is fixed, and it 
does not depend upon the order of the elements within V. So it is independent of the order. 
And for a given set, the choice of V similar to the majority function for a given set, the 


choice is fixed. 
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Algorithm cw) : 


* [0] The commander signs an order and sends it to every lieutenant 


* For each lieutenant i 
* If he receives a message of the form v:0 and has not received any order 
OQ 
* [1] Sends the message v-0:i to every other lieutenant 
* Lietuenant / receives a message of the form(v:0:j Tri iTe and® € V; 
& * [2a] Add(Vto V, ) 
* [2b] If ( (k<m) se sends (v: Oj. ahd ifto every lieutenant other than (j, ... j,) 
* [2c] If lieutenant i is not sending a message, it sends a message to 
| everybody telling them that it will not send a message. This can be 
detected via a timeout as well. 


Gp When lieutenant i will get no more messages, it chooses choice(V,) 


NPTEL 


Now, the algorithm goes as follows. It is a very simple algorithm, substantially simpler 
than what we have seen. The proof is also substantially simpler. So the commander signs 
an order and sends it to every left. So the commander's value. So here also, we make the 
same assumptions about loyalty. Commander can be disloyal, so it does not matter. It 
essentially takes a value and essentially signs it. And commander is General 0, sends it to 


every lieutenant. 


For each Lieutenant 1, if he receives a message of the form v : 0, which means from the 
commander and has not received any order, so then it set V i. So lieutenant i sets, it set V 
1 equal to the value that it receives. And then so, step 1, it sends the message v 0 i to every 
other lieutenant. So, what happens? So what happens is for the first message that any 
lieutenant gets in this synchronous algorithm, it adds the value V to its set and it appends 


its signature v 0 i and sends it to every other lieutenant. 


So now, let us consider the steady state case. So steady state case is that Lieutenant i 
receives a message of the form, the value V initially signed by the commander and then a 
sequence of signatures from j | to j k. And the value V is not present in its own set. Well, 
how can this happen? Well, this can happen if the commander is not loyal, then the 
commander will essentially be sending different values to different nodes. All of them will 


have its own signature, which is fine. 
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So, then the nodes will be exchanging these messages among each other, which was similar 
to what we were doing in the previous algorithm. But in the only differences that in this 
case, these are signed messages. So the advantage of the sign messages is that it is possible 
to exactly know who has received what, and then if the rest of the lieutenants just exchange 
these signed messages, just between themselves, they will get an idea of every single 


message that the commander has sent. 


So, this is important that the lieutenant generals cannot create any new values. So this was 
not the case in the previous algorithm, but in this case, all the messages have to be initially 
signed by the commander whose values and the values of those messages cannot be forged. 
So essentially, any values that is there in the system, all of those values have to ultimately 


be generated by the commander because nobody else can. 


So, the task of this algorithm is just to get all the values that the commander has generated 
and sent, just in case it is disloyal, and to ensure that all of those values are there with all 
of the loyal generals. So, if a loyal general has all the values, which the commander has 
generated and sent in the first round, and if they get the set, then they use the choice 
function. And as long as the set of values is the same, the output of choice of these also the 


same. 


So, what is the main aim of this algorithm? The main aim of this algorithm is just to 
circulate all the values that the commander has generated and ensure that they reach all the 
loyal lieutenants. So, the steps 2a and 2b are vital in doing that. So in this case, if a message 
is received with the form v 0 j 1 toj k, and if v is not there in the set V i, so it is one of 


those traitors values that the commander has generated, then we add v to V 1. 


And if k < m, so k is pretty much equal to the number of signatories of this message, other 
than the commander itself. So this is the value of k. Please do not confuse it with the k we 
have been using in the previous algorithm. So if k < m, then lieutenant I sends. So it 
appends its own signature, which essentially becomes j k + 1 to the message, and sends it 


to every other lieutenant, which is not in the original set j 1 toj k. 
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And of course, it does not send a message to itself, so its sends a message to every other 
lieutenant. And the message is basically that a new value has been discovered. In any round, 
if a certain lieutenant does not send a message, it should send a message to everybody 
telling them that it will not send a message or this can be detected via timeout, and a default 


value can be inferred. 


When Lieutenant i will get no more messages, it chooses choice of V i. And what we have 
just discussed that as long as every single value, which the commander generated and sent 
is all of those values are there with all of the loyal lieutenants, they have a set of all the 


unique values. 


And the moment they compute the choice function on it, the output is guaranteed to be the 
same, and this is essentially the Byzantine consensus, the Byzantine agreed value. So what 
is the aim of this algorithm, is just to circulate all the values that the commander has sent 


to everybody among all the nodes. That is the key, that is the key aim. 
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Clarification 


* How will a lieutenant know that it will not get any more messages? 


* In round k every lieutenant gets a message with k signatures. 


* The lack of a message can be detected. 


* By the end of round k every loyal lieutenant is sure that it has seen all the 
messages with k signatures 


* We will have at most m rounds (needs to be proven) 


* Thus the algorithm will terminate. © 4 
“WNY 


2) 
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So what, so a quick clarification. So how will a lieutenant know that it will not get any 
messages? Well, so how do we terminate? So in round k every lieutenant gets a message 


with k signatures. So other than the commander, of course, so the lack of a message can be 
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detected. So that by the end of round k, every loyal lieutenant is sure that it has seen all the 


messages with k signatures. So this can be insured. 


So this is something we need to proof that we will require at the most m rounds where m 
is the number of traitors, and ultimately the algorithm will terminate. So before a question 
is asked, let me answer, it is the most common question, that look, if the commander sends 


different values to different people, it should just take a single round. 


So if it has sent, let us say five different values to different people, it should just take a 
single round where all of them just sent all the values that they have gotten to everybody, 
and then they can compute a choice on the values. However, it is not that simple. The 
reason is that we might have, maybe V 2 was sent to a lieutenantet, which is actually a 


traitor. 


So what the traitor would actually do is that it will actually keep quiet and it is not going 
to send the value to everybody. So this value will not circulate. And since this value will 
not circulate, it requires many more rounds to ensure that all of these values that have been 


sent, all of them are in circulation till the point of termination is achieved. 
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Proof 
*1C2: 


* If the commander is loyal, he will send v.0}to every lieutenant. 
) © Since v cannot be forged, all the lieutenants will get v. They will only pass v. 
© Thus Vill only have a single element, y 
-' / 


* ICL: commander is a traitor 
* We need to prove that for two loyal lieutenants: V; = V; 
* Consider an element@€ V; 
aR Pee sa 
* When i added v: Zs J 
* Either this is the first message (round 0). Then i sends v to j. vey 
* If not, then i must have gotten the message: Vi0-) fg ome SSL ] 


* Ifjis one of (j, ,), it must have already gotten an order with 7) This means, € V)) 
G * * If jis not one of them, then i sends the order to j (step 2b) d 
5 


; y, * The only case we need to consider is when (k=m), and@)€ V; 
NPTEL 
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Algorithm 


* [0] The commander signs an order and sends it to every lieutenant 


* For each lieutenant i 
* If he receives a message of the form v:0 and has not received any order 
*V,={v} 
* [1] Sends the message v:0:/ to every other lieutenant 


* Lietuenant i receives a message of the form v:0;,:....j,andv € V; 
* [2a] Add vto V, 
* (2b) If (k < m) sends (v:0;j,:.../j,:i) to every lieutenant other than (j, ... jj) 


* [2c] If lieutenant iis not sending a message, it sends a message to 
everybody telling them that it will not send a message. This can be 
_ detected via a timeout as well. 


*)3) When lieutenant / will get no more messages, it chooses choice(V,) 


NPTEL 


So let us now prove algorithm. The proof is reasonably simple. So we will first prove IC2. 
So if IC2 says if the commander is loyal, then what happens? Well, then only a single 
message of the form v 0 will be sent to every lieutenant. This eases up our job significantly. 


Since v cannot be forged, all the lieutenants will get v. They will only pass v. 


So regardless of the number of rounds, the only value that will be in circulation will be v. 
So all the sets, set of values with each node V I, all of them will only contain a single 
element v. So the choice functions will only return v, and hence Byzantine consensus is 


achieved. So this was an easy case. 


Now, consider the more difficult case for the commander is a traitor. So this is exactly what 
this algorithm was designed for. So we need to prove that for two loyal lieutenants, V i and 
V j, they actually get the same set. So consider an element v element of V i. So what is it 


that happened when i actually added v? 


So when i added v, either, this is the first message, which is round number 0. If this is the 
first message, then 1 would have sent the value of V to j. So then we would have definitely 
been in V j. If not, then I must have gotten a message of the form v 0j | toj k. Ifj is one 
of j 1, to j k, this means that j has put its valid signature on the message. So it must have 


gotten an order with v, and v thus must be there in J set also. 
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Now assume that j is not one of them. If j is not one of them, well, that is also simple. This 
means that j is not inj 1 toj k. So, i, in any case, if you see the steps in our algorithm. So 
if we just look at step 2b, if k < m, then what i does is it sends this message to every other 


lieutenant other than, of course, the signatories of the message. 


So in this case, we are assuming that j is not part of the original list of signatories. So i will 
send v to j. Soj will anyway have it. So if j is not one of them, i will send it, which is step 
2b. The only case that we need to consider when an element v that is there in capital VI is 
not there in j's set is actually when i get the message and its last round, when k = m, and 


this value is not there in V j’s set. So what do we do now? 
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Proof - II je--Q in 
* Since k = m, and the commander is a traitor. {) 74 
-* At most (m-1) lieutenants are traitors. - a © 
* This means one lieutenant out of j,.... j, is loyal 4 A 
* Let it bef, y 0 | 
+ j must have sent(V}to/. j Q 

* This proves that: Vity abe 
. ifv € Vithen eV; Vj Lv, ays SHON 
* Second, m rounds are sufficient. +m. 
7 ee 

Q ‘ 


So what we do now is what we need to prove. So since k = m and the commander is a 
traitor, what can we assume? At most m - | lieutenant are traitors. So we are assuming a 
total of m traitors. One of them is the commander who is the traitors. So we have m - 1 
lieutenants are traitors. This means one left end out of j 1 to j k. So, so what is the crucial 


point. Before it is asked, let me explain. 


So in the last round, which is the mth round, every single message needs to have m 
signatures. So every single message has to have m signatures. So it is not possible for one 


of the traitor nodes to actually wake up in the last round and start sending messages. That 
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is because it is not going to be able to actually get m signatures. So the message is going 


to be rejected. All the loyal lieutenants are going to reject the message. 


What that traitor can actually do is it can try to play mischief. And instead of sending 
messages to everybody, just send it to a small set and try to exclude one of the nodes, which 
in this case General j, the node corresponding to General j, which will not receive the 
message. So we want to claim that this is not possible. The way we will do it is if we 
consider the set of all the signatories in the last round, which is j 1 to j m, m - | of these 


are traitors. 


Not m, because the commander is a traitor, which means one of them is actually honest. If 
one of these signatories is honest, what this signatory would have done is that when it got 
the value v, it would have sent the value to all the nodes who actually do not have that 
value. And since j is not there in the set, j x must have sent a message to j. So this is very 


important. 


And let me repeat, since there are m signatories in the last round with at most m - | traitors, 
one of them must have been honest, and that node must have sent v to j. So this proves that 
if v is element of V i, then v is also an element of V j, and we can also prove the reverse 
also. So, so this basically means that the set V 1 is essentially as subset of V j. But then also 
we can use the same argument to prove that V j is also subset of V 1. So essentially, these 


are equal. 


So for any two loyal nodes, their sets are actually equal. So then what we say is that m 
rounds. So this further proofs that m rounds are sufficient, mainly because we will have a 
minus | traitors. So one of those signatories has to be honest, and that will ensure that all 
the rest of the nodes get the value. So all the values that the commander would have sent 


in the first, in the Oth round will all become visible, all get stored. 


And the choice function on the same sets, same set of V i or V j will return the, exactly the 
same value, whatever is the value, we do not really care, but it will be the same value. And 
this pretty much establishes Byzantine agreement. So this completes the proof for our 


Byzantine consensus algorithm, which is a form of a command consensus. 
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And, we looked at a very expensive solution, but we also looked at a better solution where 
even if we have m traitors, as long as we have more than m nodes, well, that we will have 
otherwise it will become aqueous problem. So even if we have m + 1 nodes, then of course 


it is trivial, but anything > n + 2 is when the problem is interesting. 


And we can, using signed messages of this type, we can solve the consensus problem, 
where of course the signatures can be appended. So we will see something very similar in 


principle when we discuss blockchains. 
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Advanced Distributed Systems 
Smruti R. Sarangi 
Department of Computer Science & Technology 
Indian Institute of Technology, Delhi 
Lecture 15 
Virtual synchrony, 2-phase and 3-phase commit protocols 


In this lecture, we will study about virtual synchrony and commit protocols. So the main 
idea over here is that we would like to see how to affect group communication, and that 
too reliable group communication, and how to use that to solve a very basic problem in 


distributed systems called the distributed commit problem. 
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Virtual Synchrony 


So, first we discuss virtual synchrony. 
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One-to-one Communication: Unicasting 


, 
(ey) pm Server 


* Client-server 
* Messages can get lost. 
* There is a need to resend messages. 


* Two commonly used semantics 


* At-least-once semantics: The operation will be carried out by the server at 
least once. 


. At-most-once semantics: The operatiof il be farried out at most once. 
& 
ereL 
So let us consider a one-to-one interaction between a client machine and a server machine. 
So it is possible that there can be faults in the middle and some of the messages may be 
dropped. So given the fact that messages can get lost, there would be a need to resend 


messages. So let us assume that the client sends a message. 


So either the unreliable network can drop the message over here, or maybe the server can 
crash or maybe the response can get dropped. So the client does not really have a 
mechanism of distinguishing between these three scenarios. So often when clients do not 
get a response within a pre-specified period, they timeout, and they send their message 


once again. 


So this is where the semantics of the server is very important. The question is that if let us 
say this is we are drawing money from a bank account, and if the message is sent, once 
again, we will have a double withdrawal which should not be allowed. So the two 
commonly used semantics that servers typically provide are At-least-once semantics and 


At-most-once semantics. 


So in At-least-once semantics, if you are sending multiple messages, which are essentially 
multiple copies of the same message, it is guaranteed that at least one of them will execute, 


but can be more. So in our bank account example, we do not, do not want this. 
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We want at most once, which basically means we are fine if the transaction does not happen 
because in any case, the client has realized that something has gone wrong, if the 
transaction does happen, it should happen only once, which means that if money is being 
debited from the bank account, it should happen only once. So this is the, At-most-once 


semantics. 


So different models have to provide one of these semantics, At-most-once being the more 
common. So what clients typically do is they assign a unique serial number to every 
message which can be randomly generated. So even if they want to retransmit the message 


they keep the serial number the same. 


And so the server sees the serial number. If that message has been processed in the recent 
past, then the response is directly returned. the message is not executed once again, the 


command in the message is not executed once again. 
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Notion of Process Groups: Multicasting 


* Processes can exhibit all kinds of failures O¥ . ‘o a 
* Fail-silent: Just fails without any intimation. =A 7 
* Fail-stop: The failure can be detected. 4H 
° Fail-safe: The failure is benign. lion “"y) (5). 


* Create a group of processes to service the client’s request | 
* Replicate the state across processes ‘wet 


|* Give the same user input to all the processes, collate the outputs, and decide the 
- result based on voting 


* Failure tolerance — Rtl hy 
* To tolerate k fail-stop failures, we need k+1 processes. 
* If processes produce arbitrary outputs, we need 2k+])processes (use voting) 
® * If the process sending the input is malicious, we need 3k+1 processes (Byzantine) 


NPTEL 


So now, let us consider a slightly more sophisticated form of reliability where we actually 
create a process group. So let us first look at the kind of failures we are talking about. So 
one is a Fail-silent failure, where a process just fails without any invention. So nobody else 


actually gets to know the process has failed. 
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The other is the Fail-stop failure mode. So in this case, the failure can be detected. But, so 
then something can be done. At least we know that a given process is failed. The third is 
Fail-safe. In this case, the failure is benign, which means that even if a process is failed, 


we allow it to remain fail because it will pose no harm. 


So what we actually do is we have a replicated state machine model which many consensus 
protocols notably, Raft also use. So what we do is that we actually replicate the servers. 
Each server has a state machine, which represents the internal logic of the server. So we 


have a centralized dispatcher. 


Any client request, it is sent to the dispatcher. What the dispatcher does is that it creates 
multiple copies of it and sends it to the different servers. And all of them apply the 
command in the message to their internal state machine. So one advantage is that just in 
case, let us say two of these servers fail, at least this server can keep the entire system 
running. So this adds a layer of redundancy to a distributed system, such that it is immune 


to failures. 


So this is done by replicating the state across the server processes and given the same user 
input, so same user input is given to all the servers, and they pretty much compute their 
results, which are supposed to be the same in the normal case. And then what we can do is 
that when their responses come, we have a voting engine where we essentially choose the 


majority, and that is passed on to the client. 


So let us look at some of the failure tolerance guarantees. So if there are k fail-stop failures, 
we essentially need (k + 1) processes because one of the processes will keep running. If 
the processes are given the correct input, but they can produce arbitrary outputs. We need 
(2 k + 1) processes with k failures. The reason is that we will have a majority with at least 


the (k + 1) correctly running once, hence voting in this case would work. 


If the dispatcher is not trusted, so the dispatcher is like the commander in the byzantine 
consensus algorithm. So if let us say the dispatcher is not trusted, so it can create different 


versions of the client request and send it. Then, we say that the input is wrong, it is 


482 


malicious. So tolerate such kind of failures as we have discussed in the lecture on Byzantine 


agreement, that we need (3 k + 1) processes. 
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Reliable Multicasting sf | f eben) 


* Define a multicast channel, c ps 
* Sender group SND(c) — Processes that can send messages on channel c 
* Receiver group RCV(c) — Processes that can receive messages on channel c 


* Reliability guarantee _ 


* If process p € RCV(c), message m should be delivered to p, as long as p does 
not change its membership throughout the duration of the message transfer. 


* = 
* Atomicity guarantee 
* If message m is delivered to process p, then m is delivered to all the processes 


inRCV(C) 
© 
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Notion of Process Groups: Multicasting : : 


} 
* Processes can exhibit all kinds of failures SE 0) = \°0 
FW aie, 


* Fail-silent: Just fails without any intimation. 


* Fail-stop: The failure can be detected. : wad 
* Fail-safe: The failure is benign. e Lion ey {AS }- 
eee bn | 
* Create a group of processes to service the client’s request 
* Replicate the state across processes 


|* Give the same user input to all the processes, collate the outputs, and decide the 
* result based on voting 
Atl 


tp nt 


* Failure tolerance : —— 
* To tolerate k fail-stop failures, we need k+1 processes. 

* If processes produce arbitrary outputs, we need 2k+])processes (use voting) 
() * If the process sending the input is malicious, we need 3k+1 processes (Byzantine) 


3Rt| 
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So let us just now define the notion of reliable multicasting because as if, as you see, this 
operation of sending one message to multiple servers is known as multicasting. So one-to- 
one message exchange is a unique cast and a one-to-many is a multicast. And this is where 
we desire some degree of reliability. So this is a very central mechanism in the design of 


our distributed system. 
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So let us define a multicast channel c, where if you look at it, this entire thing is a multicast 
channel c. And so we have a sender group, a sender group is a set of processes that can 
send messages on the channel. And we have a receiver group, which are a set of processes 


that can receive messages on channel c. 


So the reliability guarantee is like this, that if a process p is a part of, is an element of 
receive c, then message m should be delivered to p as long as p does not change its 
membership throughout the duration of the message transfer. So which basically means 
that we are creating these process groups of a send group and a receiver group, so assume 


we want to deliver a message m to process p. 


As long as p is an element of the received set and, it throughout the duration of the transfer, 
it continues to remain an element, the message should be delivered to p. So there is also an 
atomicity guarantee, which essentially says that if the message m is delivered to process p, 


it is delivered to all the processes in the receiver set of the channel. 


So why do we have the reliability and the atomicity guarantees? Well, the reason we have 
these guarantees is because the sender and receiver process groups can actually change 
with the execution. So the process groups, the process groups may change. So given the 
fact that the process groups change, the question is that will we deliver a message to, let us 


say one process that was there in a group, and which was not there, what is the correctness? 


So the reliability says that, look, if in the entire duration of the message transfer, a process 
was part of a group, it needs to get it. The atomicity guarantee says that if a message m is 
delivered to a process p in the group, then it is delivered to all the processes in that group. 
So we have all and or none logic, which we typically have with most autonomous city 
correctness constraints, that either a message is delivered to everybody in the process 


group, all the receivers in the process group, or to none of them. All or none. 


(Refer Slide Time: 11:04) 
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Implementation in a LAN 


* The sender sends a message to a set of receivers 


* One of them sends an acknowledgement (ACK) on a shared channel 


* The rest snoop the message — 


* If any receiver hasn’t gotten the original message, it requests for a re- 
transmission 


¥) 
© 
So this is not very hard to do on a simple LAN of course, local area network, where all the 
modes are connected with each other on a bus. So the sender in this case would simply 
send a message to a set of receivers. One of them sends an acknowledgement, an ACK on 


the shared channel that it has gotten in. The rest are pretty much snooping on the channel. 


So what happens is that if let us say one of them has sent an ACK, the rest do not have to 
set an ACK. Furthermore, it is possible that if a receiver has not gotten the original 
message, but it sees an acknowledgement for it, then it sends a separate request to the 


sender for the original message, which it gets back. 


So this is kind of a simple way of doing this on a local area network, but the issue is that 
this requires some access to the internals of the network because in general, user processes 
are not allowed to snoop the Ethernet channel. So, so this is a good starting point, but we 


need to have a protocol that completely works at the level, at the application layer. 
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Virtually Synchronous Multicast 


* Processes can fail at any time 
* Hence, we need to change our definitions 


* Virtually synchronous multicast 
* Amessage is delivered to all non-faulty members of the group 
* Allthe members agree on the current group membership | yee 
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So this is where the notion of virtually synchronous networks or a virtually synchronous 
multicast comes in. So the assumptions are as follows, the processes can of course feel at 
any time with or without information. So we define a virtually synchronous multicast, 
which is like this, that a message is delivered to all non-faulty members of the group. So it 
is atomic, is either delivered, or it is not delivered. But if it is delivered, it is delivered to 


all the non-faulty members of the group. That is point number 1 


Furthermore, all the members of the group agree on the current group membership. So 
which means that the group membership, all the processes that are a part of the group 
membership, all the members agree on it. So that is the idea that a message is delivered to 
all the members, all the non-faulty members. And furthermore, all the members agree on 


the current group membership. And then either, we have an atomic delivery. 


So we define the notion of a view, just change it to a pointer. We define the notion of a 
view over here which is the subset of the received set and a sent set. So the processes are 
added or deleted by, via a view change. In a stable state, all the processes agree on the 
current view, and pretty much the way a view is defined, a message is either delivered to 


all the processes in the view or to none. 
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And of course, the view is considered to be a set of working processes that are a part of 
non-faulty processes. That will clearly be a subset of the receivers and the senders. The 
good thing about the view in a virtually synchronous system is that a view is essentially a 
set of non-faulty processes, where you have atomic and reliable and reliable delivery 
guarantees. Furthermore, all the non-faulty processes see all the view changes in the same 


order, which means that there is a global consensus on the way that the view changes. 


(Refer Slide Time: 15:20) 


Virtually “yoeiretious Multicast — View Changes 
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So, let us say that we have a given view, which means we have a set of processes. We have 
a process group. So view is essentially being used as a synonym for a process group. And 
the assumption here is that the process group is a subset of the global senders and receivers. 


And all the servers in this process group are non-faulty, at a given point in time. 


So, let us assume that V changes to V star. So if a message, if a message m is sent to V 
before the view change, then either all the P e (V M V*) receive m or none do. So the 
important point is that, let us define a view, V, and let us define another view V star. Let 


us say that before the view actually changes, I send a message m to view V. 


So, one thing that is clear is that all the rest of the processes that do not change, they will 
get the message, should get the message, but consider this set, which lies at the intersection. 


So what we claim is that all the processes in this set also, also get the message m. So there 
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is an inherent sequentiality in the view change, which means before a view changes, all the 


messages are delivered to the overview. 


Then in a sense, so, if I were to consider a linear timeline, these are all the message 
deliveries of m, m’ and so on. Then there is a view change. And after that, all the messages 


get delivered to the new view, with the same guarantees of atomicity and reliability. 


So, the other interesting thing is that a message sent to view V can be deliver only to 
processes in V and not to successive views. So I will tell you the overall crux of this 
discussion. The overall crux of this discussion is like this, that we are defining process 


group of non-faulty processes called a view. 


So, the view is essentially the current set of receivers. So if I were to send a message, it is 
either delivered to everybody or to nobody. And if the view changes, that is considered to 
be an atomic event. Just before the event, all the messages get delivered to the previous 
view. Just after the event, all the messages get delivered to the new view. So this is a 


sequentially happening thing. 
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Example of Virtual Synchrony 


Virtually Synchronous Multicast — View Changes 


* Let us say that view V changes to view V* 


* Ifa message mis sent to V before the view change, then either all 
p €(V NV *), receive m, or none do. 


* All non-faulty processes in the same view get to see the same set of 
multicast messages. 


* A message sent to view V can be delivered only to processes in V, and 
not to successive views. 
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So let me give an example of virtual synchrony. Let us say that the view here would be P1, 
P2, P3, P4. So any message will have to be sent to all of them. And so similarly, if I were 
to consider this message of P2 sending the message, it will have to deliver the message to 


P1, P3 and P4, which means the rest of the processes. 


Now let us assume that there is a failure with process P2. So the view changes to P1, P3 
and P4. So in this case, any subsequent message transmission and message receiving that 
would only go to the processes in a new view, which are P1, P3 and P4. In this case, P1 


also sends, it sends to P3 and P4. 
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Now assume that P2 comes up and P3 goes down. So the new view becomes P1, P2, P4, 
and any subsequent message transmission is only sent to the members of the new view, 
which are P1, P2, and P4. Finally, it is possible that at one point in time, denoted by this 
dotted line, the view will finally change to P1, P2, P3, P4. 


So here, if I were to draw a line like this, this would be wrong. And because of, why is 
that? Well, that is because nothing can be delivered to this process. It is faulty. And also if 
I were to draw a line like this, that also would be wrong because this message was 
fundamentally meant for the processes of the older view. So it cannot be delivered to a 


process in the newer view, which is something that is being said here. 


A message sent to view V can only be delivered to processes in V, and not to processes in 
successive views, which is why this particular message transfer would be illegal. So we 
consider a view a closed world of its own of where atomicity and reliability guarantees are 
provided. And once the view changes, all old messages are like all dead. So our time starts 


from when the view changes. 
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Few more assumptions 


* Asender to a view V should be a member of view(V) 
* Many people define virtual synchrony by relaxing this assumption 


* Ifa sender s €V crashes 
o* First we flush its multicast message (if possible) 
* Then removes from V , 
* As long as s is a member of V, all the assumptions of virtual synchrony 
continue to hold_ 
* Ifa receiver r € V crashes 


* We can either deliver the message later (rest of the processes in the view 
have a copy) 
* Or remove the receiver from the view. 
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So a few more assumptions. Well, so people in the literature have made two kinds of 


assumptions. So should a sender be a member of view V, well, many people say yes, many 


490 


say no, need not be. So the assumption that we are following is that the sender should be a 


part of view V. 


So now, assume that if a sender, that is part of a view, the sender crashes. So if the sender 
crashes, well what happens then? What happens is that we need to flush all of its multicast 


messages that are in flight, if possible, because we wish to provide atomicity guarantees. 


So, this will be elaborated. It is a fairly complex process. So I have just summarized 
multiple slides worth of material into a single line, but this is a reasonably complex process. 
So this will be elaborated quite a bit in subsequent slides. So first is that we take care of its 
messages, which means that either messages get delivered to all the processes in the view, 


or whatever messages are there, they are flushed from the system. 


Then we remove s from V. So then of course, we go in for a view change. So as long as S 
€ V, all the assumptions of virtual synchrony continue to hold. And after this is removed, 
we go for a view change. If a receiver crashes, we can either deliver the message later 


because rest of the processes in the view have a copy. 


So if let us say, it comes back up, we then deliver it later if we are willing to tolerate 
transient fault or the other is that we remove the r € v, the latter being more common where 


the receiver is removed and we go for a view change. 


So one important point is that here we are simply talking of delivery, and we are not talking 
about the consistency requirements like causal or sequential consistency. So virtual 
synchrony as such is independent of the order or message delivery. The only thing that 
virtual synchrony cares about is essentially these process groups, which we are also calling 


as Views. 


So it basically says that message is either delivered to everybody in the current view or to 
nobody. And the view changes appear to happen to everybody in the same global sequential 


order. 


(Refer Slide Time: 23:40) 
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Implementation 


* We define view(p) for process p as follows “TW p 
* Ifp € view(q) then q € view(p) 


* Messages received by pare queued in queue(p) 
* If p fails: . 
* First flush the messages sent by p if they are not delivered to any process in 
the view. Ensure all the outstanding messages are delivered. 
* Then remove p from the group (view) 
* Sending a message 
* Attach a timestamp with each message (increases by 1 with every send) 
) _* Assume FIFO channels — ee 
‘S * Highest numbered message from q that is received by p is stored in rcvd,,[q] 
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Implementation ! 


Vivek Lg 1e5Sa 
* We define view(p) for process p as follows oa 
* Ifp € view(q) theng € view(p) a a 
* Messages received by p are queued in queue(p) 
* If p fails: 


* First flush the messages sent by p if they are not delivered to any process in 
the view. Ensure all the outstanding messages are delivered. 


* Then remove p from the group (view) 


* Sending a message 
* Attach a timestamp with each message (increases by 1 with every send) 
* Assume FIFO channels 
6 * Highest numbered message from q that is received by p is stored in rcvd,,|q] 
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So how would implement a virtually synchronous system? Well, that is easy to do. And in 
fact, it is required in any practical distributed system. There is a requirement, and that too 
a reasonably strong requirement to implement virtual synchrony. So this is how we 


implement. So first we start with defining view p, for process p as follows. 


So if p € view (q), then q € view (p), which means the views are identical. So if a given 
process is in q’s view, then q € view of that process. So any message that is received by 
process p is, at the moment, is temporarily kept in a buffer called the queue of, so consider 


this to be an array, which points to this buffer. 
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So, any message that is received by p is stored in this temporary buffer. Now assume that 
p fails. So if p fails, we first flush the messages sent by p. If they are not delivered to any 
process in the view. So if let us say p has failed, and let us say, p has some of the, some 
messages that intends to send, and they have still not been delivered to any process, then 
the messages are flushed because if any, if a single process gets it, all the processes need 


to get it. 


Then we need to ensure that all the outstanding messages, which means those messages 
that some subset of processes in the view have gotten, they are ultimately delivered to all. 
Once that has happened, we remove p from the view and we go in for a view change. So 
the important aspect that we need to consider over here is that if any process p fails, then 


we need to look at all the messages that it has sent or is intending to send. 


So if a message is still there in it is output queue, and it has not been sent, the message is 
flushed. Otherwise, we wait till the message is delivered to everybody. And then we 
remove it from the group. So sending a message, what we do is we attach a timestamp with 
each message. It increases by one with every send. So we essentially attach a timestamp 


and the ID of the sender. 


So let us provide a solution to this assuming FIFO channels. So for this, we introduce the 
data structure received p, q. So the highest numbered message from q that is received by 
p, let it be stored in this array, which essentially, so, as we have just described, every 


message has a sequence number. So the structure of a message would look like this. 


So if q sends a message to p over a FIFO channel, p maintains a data structure, which is 
the message received from q with the highest sequence number. The latest, the sequence 
number of the latest message received from q. So we will see that this helps us ensure 


enforce virtual synchrony. 
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Implementation - II 


* p periodically sends rcvd, |] to all the processes in its view 


* Each process p records rcvdy (from any other process q) in an array 
remote, (]l]. 
* remote, [q][] indicates what p knows about message arrival in node q 


* Consider a sample remotel]l] array 
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So, let us now look at the second aspect of the implementation. So p periodically sends a 
received p, the received p array to all the processes in its view. So recall that the array was 
defined over here where the received p array pretty much says, what is each column of this 
array, let us say the qth column, pretty much stores what is the sequence number of the 


latest message that it has gotten from process q? 


So, p periodically sends this array to all the processes in its view. Each process p records 


received q, where we are assuming q is a generic process that can send messages to p. So 
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this is stored in an array remote p. So the structure of the rcvdp[q] that it has rows and 
columns. So if I were to consider the qth row, which in this case is for P2, so it stores a 


vector of four values, and this is pretty much the received q array that came from q. 


So, each element of this row stores the sequence number of the latest message received 
from the corresponding process, of that column. So let us say that, let us give a practical 
example of a remote area. So, consider this row over here, which is the row for Process 2. 
So in Process 2, what this is saying is that Process 2 has received a message with sequence 
number 2 from Process 1, it has received a message with sequence number 3, from Process 
2, it has received a message with sequence number | from Process 3, and finally, it has 


received a message with sequence number 4 from Process 4. 


So this is the state. So, so let this remote area be from the point of view of Process 1, where 
Process 1 = p. From p’s point of view this is what P2 has been seen, because this is the last 
state of the received q array that was sent to p. So it has simply stored it in the corresponding 
row of its, of the 2d array remote. Similarly, it has something about, Pl has something 


about P3, and P1 has something about P4. 


And so, this would basically be the received 3 array that it got from P3, and this would be 
the received 4 array that it got from P4. And of course the row that is corresponding to 
itself is pretty much its own received array. So, this is received p. So, the important point 
to consider over here are actually these columns, where the columns are telling us 


something. 


So, consider the first column. So this is what all the processes know about messages from 
P1. So clearly, P1 will have the latest information about its message because it also sends 
to itself. So, in the column the message with Sequence 2 has been received by P1, by P2, 
and by P4. However, P3 has not gotten that message. The latest message that it has gotten 


has a sequence number with 1. 


That is the reason when we compute the minimum, the minimum does not match all the 
entries. There is a diversion. So this message, which is message 2 from PI, this is 


considered to be unstable. That is because it has not reached all the processes. So it is still 
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in flight. However, if I consider this message, which is message 1 from P3, so it has reached 
P1, it has reached P2, it has reached P3, it has reached P4, and the minimum is equal to all 


the elements of the column. 


So message | from P3 is considered stable. So message 2 from PI is unstable, we can 
clearly see why. And message | from P3 is stable, we can also see why, because everybody 


has gotten it. 
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Stability and Flushing of Messages [Details] 


* Amessage is stable if it has been received by all the processes in the view 
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Implementation - II 
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Stability and Flushing of Messages [Details] 


* A message is stable if it has been received by all the processes in the view 
* (refer to the min. vector) 


* The message can be delivered to the process's next layer - pe eae ee 
* To remove a process a failure-detecting process needs to multicast a flush 
message to all the processes in the view. —- 
* All the processes stop: sending new messages. bebe ‘aL - u 
* Processes send their revd arrays to other processes. yoo) ¢_ . Pehle 
* They also elect a leader (coordinator). Once a process finds that its messages have all 
-» been received, and it has received all the messages)it sends a flush_ok message to 
the coordinator. Otherwise, after a timeout it sends its list of unstable messages, —>comdera iw 
* If the coordinator does not get all the flush_ok messages in a given interval, it 
_ collects all the unstable messages, and multicasts them again. 
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Example of Virtual Synchrony 
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So now, let us look at the flush process and the view change process in some more detail. 
So, till now we have slightly been looking at it superficially, but given that we have 
introduced these tools, we can look at it in slightly more detail. So, a given message, a 
message m is stable, if it has been received by all the processes in the current view, refer 


to the minimum vector. 


So, if the minimum matches all the injuries in the column, the message is stable for the 
remote array. So, the message can be delivered to the process’s next layer. So of course, a 
remote area, the 2D remote area from the point of view of a single process only, but even 


if a single process knows that all the processes in its view have gotten the message, this 
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means it is stable. So it can safely be delivered to that process’s next layer, or what is called 


in the terminology of typical consensus papers, is applied to the FSM. 


So it can be applied to its, to the FSM it maintains, and a change can be made. Now, let us 
look at some of the complicated cases over here. So let us say that we have a failure 
detecting process, and it detects that a process has failed, it needs to be removed. So it 
multicasts a flush message to all the processes in the view. And a flush measured message 


indicates the possibility of a view change. 


So the moment that a process receives a flush message, the first thing it does is that it stops 
sending new messages. So it enters a pause state. The processes send their received arrays 
to other processes, such that all the processes can update their remote arrays. So all the 
processes can update the remote arrays and figure out which message is stable and which 


message is not stable. 


So, the important thing to notice over here, and we have also seen this in the Chandy— 
Lamport algorithm, that the moment a process receives a flush message, it simply stops 
sending and creating new messages, because now a major change is going to happen. So 
only the next view will, process will actually be sent. Then, the processes elect a leader or 


a coordinator. 


So, what happens is once a process finds that its messages have all been received, which 
basically means its latest message is stable. So this is important. So once the process finds 
that all of its messages have been received by the rest of the processes in the view, which 


is stand amount to saying latest message is stable, and it too has received all the messages. 


So that is why, so, so why is that the case? Well, that is the case because at least the sender 
always knows or always has the latest message. So just, just look at this, just look at the 
diagonal elements in this figure, and it will be clear what I am trying to see. So just take a 


look at the diagonal messages over here. 


So, this message is what Process | knows about messages it has received, which means 
that, so this is based on the fact that when Process | creates a message, it immediately 


receives it. So this is the latest state about Process 1, which it keeps. And so similarly, if 
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Process 2 is saying, is sending a sequence number of 3, it basically means the moment at 
which Process 2 generated this array received 2, 3 was the sequence number of the last 


message that it created. 


And the same holds for Process 2, the same holds for Process 4. So after a flush message 
is sent, we know that no new messages are being created. So any received array gotten, any 
received array that a process gets, which is this array over here, this array over here, that a 
process gets after processes stop creating new messages, one of the entries, which is the ith 
entry of the, let us say received q is being sent, then its qth entry is going to be the latest 


that q has generated. 


So this, viewers need to convince themselves of this fact that the qth entry of received q 
will be the latest because q will at least know what is the latest message it has created. And 
if it is not creating anymore, so this would be the latest message. And so that is how a 
process can know if it has received the latest messages generated by all the processes or 


not. 


So coming back to this point, every process would know via exchanging these received 
arrays whether all of its messages have been received, which means its latest message is 
stable, and whether it has received all the messages. If both hold true, then it sends a flush 
okay message to the coordinators, which means that from the point of view of the current 
view, it is done. It has gotten everything, it has sent everything, and whatever it has sent is 


stable. 


Otherwise, the coordinator waits for some time and the processes, individual processes 
wait for some time. And if their messages are not becoming stable, then they send a list of 
unstable messages to the coordinator. So the coordinator takes upon itself, the owner is 
tasked of ensuring that all the processes sent to a view reach everybody. That would be the 


atomicity requirement. 


So if the coordinator does not get all the flush okay messages, then it will collect unstable 
messages by wearing the individual processes, adding multicasts all of them once again to 


all the servers that have not gotten them. And after getting acknowledgements, which 
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means after ensuring that every message sent in the view has reached everybody, it is time 


to change the view. 


So it send a view, change message to all the servers in V U V*, where,V star is the new 
view. So the old ones, either remove themselves from the view and the new ones add them 
to the view. So it sends a view change message and the view is changed. So this ensures 


the correctness. 


So if I were to go back to this figure, so this basically ensures that none of these arrows, 
none of these solid arrows cross this dotted line arrow. For example, if I were to delete this 
and I would do this, this would be wrong, this would violate virtual synchrony. And the 


reason is that this was sent in a previous view and it is being received in the next view. 


So this is clearly not allowed. And our method of essentially flushing all the messages and 
ensuring all of them have been received by the receivers before changing their view would 


precisely stop this situation from happening. So this was virtual synchrony for us. 
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So now assume that I have been able to achieve virtual synchrony, which means I can 
create a stable process group. And I have been able to achieve a reliable multicast, which 


means that I can send a message to a group of processors using a virtual synchrony kind of 
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approach. So what is it that I can do with it? Well, I can use this as the baseline substrate 


to design what are called commit protocols. 
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Commit Protocols 
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So, consider a typical problem which all of us have faced, and that too it is a very nasty 
problem. So we have a credit card machine, a user, of course, swipes the card on the card 
machine, then a bank which provides the credit card functionality, a data center something 
like a major, like make MakeMyTrip, Expedia which provides a travel, a ticket booking 


facility, essentially an online travel agent, and then of course the airline. 


So, we want to buy an air ticket. So unless there is a complete agreement among these five 
entities, what is going to happen is that I might charge the card, the money might be debited 
from my credit card account, the data center might do some processing, but this is where 


things may fail, and I’ll never get the ticket, even though the money has been debited. 


So, this is a very common problem, it is a very nasty and very annoying problem. So what 
most of us do in this case is that we call up the office to the online travel agent, we are 
lucky if somebody picks up the phone and then after sometimes after 10 minutes, 
sometimes after half an hour, sometimes after three days, we get the money back. In some 
cases, we get the ticket, in some cases we do not get the ticket. So that also destabilizes our 


travel plan, and it is clearly not an optimal situation. 


So, what we want is that we want all of them, all of these entities to come into an agreement. 


Either my ticket is booked or it is not booked. So in that sense, I do not mind if my ticket 
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is not booked. So I will at least immediately get to know. So I can try with the same travel 
agent or a different travel agent. I am talking of an online travel agent, a website any travel 


booking site. I can try 5 minutes later, 10 minutes later, I am okay with that. 


What I am not okay with is if my money is debited and I do not have a ticket or I have a 
ticket, but I do not have an e-ticket, I do not have an email, so I do not have any paper proof 
or a digital proof of my ticket. So, I do not want that. I am okay to be told, look, our site is 


down, we cannot book a ticket right now, but I am not okay with anything else. 
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2-Phase Commit 


* The nodes elect a coordinator 
* Phase 1a: Coordinator sends a Vote-request message to all participants. 
* Phase 1b: A participant returns either Vote-commit or Vote-abort. 


* Phase 2a: Coordinator collects all the votes. If all are Vote-commit it sends a 
Global-commit message, otherwise it sends a Global-abort message to all. 


* Phase 2b: Each participant waits for the messages from Phase 2a. It acts 
accordingly. 
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So, the first algorithm in this space is called 2-Phase Commit. We will see it as some 
problems. So the nodes elect a coordinator, they elect leader using any of the leader election 
algorithms that we have seen. The coordinator in this case, asks all the nodes to vote. So, 
so the nodes would be each one of these, the user card machine bank, data center, and 
airline. Of course the user is being represented by the command running on his BS, on this 


browser. 


So, the entities either say Vote-commit or Vote-abort. Vote-commit means we are okay 
with the transaction. So for a user, it would mean pressing the submit button, for a credit 
card machine it would mean that the card details have been read correctly. For the bank, it 
would mean there is sufficient money in the credit card account, for the data center, it 
would mean that they have been able to process the details, and for the airline, it would 


mean that they have been able to book the ticket. 


Once all of them say yes, that would mean all of them are voting Vote-commit. If one of 
them has a problem, let us say the airline does not have tickets, it is simply sends a Vote- 
abort, and then the entire process gets aborted. So the coordinator collects all the votes. If 
all of them are Vote-commit, which means the ticket can be booked, it sends a Global- 
commit message to all the nodes and tells them, okay, go ahead and commit. Commit 


means issue the ticket. 


Otherwise, it sends a Global-abort message, which basically means that you just decline 
this transaction, and the user is informed accordingly that the ticket could not be booked. 


Each participant, participant waits for the messages from Phase 2a. It acts accordingly. 
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2-Phase Commit 
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Vote-Request 
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So, the 2-Phase Commit state diagram would look like this for, so this part is for the 
coordinator. So coordinator starts at an INIT state. So initially nothing happens. It just 
sends a vote request. Once the participant that also starts in its INIT state, once that gets, 


once the participant gets a vote request, the participant decides what needs to be done. 


If it decides to abort the transaction, it sends back a Vote-abort message, and it moves to 
the ABORT state. Otherwise, if it decides it needs to commit, it sends a COMMIT message, 
and it goes to the READY state. So mind, in the READY state, it is not aware of what the 


other participants have sent. So, it waits. 


So, the coordinator also waits once it gets all the messages, if one of them is Vote-abort, it 
goes to the ABORT state, and it sends a Global-abort message to all the participants. So, 
which means that the participant gets a Global-abort message. So the participants just abort. 
So this means that any of the nodes over here has a veto power, and it can simply say no, 


and the entire process gets aborted. So, as I said, we are fine with that as a business model. 


And if all the participants Vote-commit, so once the coordinator sees that and all are saying 
Vote-commit, it sends a Global-commit message and it moves to the COMMIT state. Once 
the participant gets the Global-commit message, well, it sends an acknowledgement and it 


moves to the COMMIT state, which means it finishes or commits its transaction. 
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So this is a fairly simple state machine over here, and this works in most cases, but the 
reason that it is not used is basically when we consider failures. So after all, why did we 
do this? The reason we did this is basically because any of the nodes in here might have an 
issue with the transaction. So it would then send it would then send either a COMMIT or 


an ABORT notification. 


And if there is a single ABORT, the entire transaction would get cancelled. So, the reason 
we have problems, particularly with this is because any of these entities can fail. So failures 
are something we have to consider. And most often, we have network failures, we have 


time-outs, all kinds of things. 
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Analysis of 2-Phase Commit = @%» = 


* Let's say a participant fails, recovers and then restores its old state. 

* INIT state: No problem. It just aborts. 

* READY state: It needs to know the global decision (commit or abort) 
after it recovers. Ask the coordinator or other participants. 


* This is very problematic. 


a \ 
* The coordinator might have failed) Other processes might have committed or 
aborted. They might have erased all the history of the transaction. / vehed 


, L + We will never know what decision the coordinator took. 
_g ABORT state: Complete the abort process. 
Gyoumir state: Complete the commit process. 
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2-Phase Commit 
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So in this case, let us say a participant fails, and there is some amount of non-volatile 
memory. So the participant recovers and restores itself to the old state. So the participant 
recovers and restores itself to the INIT state with no problem, it just aborts because from 
this state, nothing really has been sent. So which means that I was there in the INIT state, 


and I just, and then I recovered in INIT state, which means I have not sent any message. 


So you have not sent any message, the coordinator also cannot proceed. So then I can just 
abort the entire transaction. That is simple. The main problem is the READY state. So in 
the READY state, what happens is I was there in the ready state, which means I sent either 
ABORT or COMMIT decision, and then I crashed and then I woke up and I woke up to be 
in the READY state. 


So if [had, earlier, if had sent the ABORT message, well, that is fine. Then the coordinator 
would have aborted, so I can abort. But if I had senta COMMIT message, a Vote-commit 
message, it kind of becomes problematic. So what I would do is that I would not know 
what happened to the system after I crashed. And I might wake up like 2 minutes later, 


which is a long time in the computer scale. 


So then I would ask the coordinator, and if the coordinator says we have all aborted or all 


committed, I would do the same. But the problem is the coordinator itself might have failed, 


or the coordinator might just have finished the transaction and erased its state. Furthermore, 


even other, so then what I would do is I would ask other processors. 


Other processes might have committed or aborted. They also might have erased all the 
history of the transaction. So I would not know what to do. So I would kind of get blocked. 
Because the thing is, I am recovering in the READY state. So kindly take a look at this. So 
I am recovering in the READY state. So when I am recovering in the READY state, I have 


already sent a Vote-commit message. 


And then, well, I do not know what is to be done because I crash and then I wake up. And 
then I do not know what the others decided. Of course from INIT if Ihad sent a Vote-abort, 
then that is okay, I would have simply gone to the ABORT state or if any state, any 
intermediate sub-state, I would have recognized that everybody is aborting, I would have 


aborted. 


But in the READY state, I really do not know what is to be done because everything 
depends on what the others have voted. And if I am not able to contact the coordinator or 
another process, then I really do not know what happened to the system. So I will keep on, 


I will remain blocked forever. 


So this is just called a blocking protocol, and it does not give us a deterministic answer, 
which is why the 2-Phase Commit protocol is not actually used. And of course, if I even 
fail in ABORT or COMMIT state, that is not an issue because I already know the state of 
the system. So to fix this problem, the problem of crashing and recovering in the READY 
state, what I do is that I propose a new protocol where I can show that diagram at first, 


which is called 3-Phase Commit. 
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So instead of two phases, I have three phases. So in this case, I add an extra PRECOMMIT 
state between READY and COMMIT. So between READY and COMMIT, I add an extra 
PRECOMMIT state, which in a sense increases my distance from, increases the distance 


from the wait to the COMMIT state, which we will see is important. 


So in this case, what we are actually doing is that if I were to compare this diagram with 
the previous one for 2-Phase Commit, there is an extra state in the middle. We call this the 
PRECOMMIT state, and we claim that by adding this PRECOMMIT state, many of our 
problems have been solved. So let us first go through the protocol and then we will discuss 


its advantages. 


So, it starts in exactly the same way, the coordinator sends a vote request message, which 
is step 1. Each of the processes gets the vote request message, so they either send a Vote,- 
abort, which immediately takes us to the ABORT state. So this part remains the same. Or, 
the, we, so this is also a time step 2, the participants send a Vote-commit message and they 


move to the READY state. 


So what happens is this part is the same that they either vote a Vote-abort or commit, and 
if they Vote-commit, they move to the READY state. So when it is in the READY state, 


the participant keeps on waiting for messages. If it gets a Vote-abort, then of course it sends 
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a Global-abort. Or, if it sends, or if it gets all the replies and all of them are a COMMIT, 


then instead of directly global committing, it sends a Prepare-commit message. 


So this should also be time step 3. So it is a sends a Prepare-commit message and it 
transitions to the PRECOMMIT state. So once a participant gets a pre, Prepare-commit 
message, so then it sends a Ready-commit message, which means, it is saying that I am 
ready to commit. That is mainly because that it is already voted that it will commit, so it 
cannot change its mind. So it will send a Ready-commit message, it will go to the 


PRECOMMIT state. 


Once all the Ready-commit messages have arrived, this should be times step 5. Once all 
the Ready-commit messages have arrived from all the participants, then the coordinator 
knows that all the participants are ready to commit. And so then it sends a Global-commit 
message. And the Global-commit message, once it is received, all the participants commit 


and send an acknowledgement as well. 


So the important point to note over here is just this additional state in this path, which is a 
Prepare-commit. And when the coordinator receives all the Vote-commits, instead of 
directly committing, it sends a Prepare-commit message, which also takes all the 
participants to the intermediate state called PRECOMMIT. And from there, once they 


receive a Global-commit message, they finally commit. 


So this is the algorithm that each of the participants has to follow. And so this part of the 
algorithm was for the coordinator, this is for a participant, they look more or less similar, 


just with this extra state. 
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3-Phase Commit 


* The nodes elect a coordinator 
* Phase 1a: Coordinator sends a Vote-request message to all participants. 
* Phase 1b: A participant returns either Vote-commit or Vote-abort 


* Phase 2a: Coordinator collects all the votes. If all are Vote-commit it sends a 
Prepare-commit message, otherwise it sends a Global-abort message to all. 


* Phase 2b: Each participant waits for the messages from Phase 2a. If it gets a 
Prepare-commit message, it proceeds to send a Ready-commit message, else it 
aborts. 


* Phase 3a: After the coordinator gets all the Ready-commit messages it sends a 
Global-commit message to all the participants. 


(eprase 3b: The participants wait for the Global-commit message. 
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This is summarized in this slide so I am not going over it once again. So, this is something 
that I just described. The operative part of it is that we actually send a Prepare-commit 
message that takes us to the PRECOMMIT stage, and then after a Prepare-commit 
message, we move to the final COMMIT, after the PRECOMMIT state, we move to the 
final COMMIT state. So this extra state over here is what we clean does all the magic. So 


how does it do that? Well, let us see. 
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Analysis 


* Consider the READY and PRECOMMIT states. 
* Ifa participant fails in the 


* READY state: It can ask the coordinator. If the coordinator has failed, then we 
can elect a new coordinator. If any process has gotten a PRECOMMIT 


message, then they proceed towards a global commit. Else all processes 
abort. 


* PRECOMMIT state: Elect a new coordinator that will send a Global-commit 
~ message. 


* Coordinator fails in the 


* WAIT state: Participants time out in the READY state 
* PRECOMMIT state: Participants time out in the PRECOMMIT state 
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3-Phase Commit 
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So consider, let us do a similar failure analysis where the participant fails and then it 
recovers to the same state. So, it is assumed that the participant maintains some degree of 
non-volatile memory, which allows it to recover to the same state. So consider a 
participant's failure in the READY state. So if it fails in this state, so it knows what it has 
voted for, it has clearly voted for COMMIT. 


So then it asks the coordinator, once it gets up, what really happened? Did you ABORT or 
did you PRECOMMIT? So if the coordinator has failed, then we can elect a new 
coordinator and the new coordinator can check that if any process has gotten a 


PRECOMMIT message, then it will proceed towards a Global-commit via this route. 
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But, if let us say after querying all the processes, nobody has gotten a PRECOMMIT, which 
basically means that no process has, so the important point to keep in mind, so this is the 
most important point that in the previous case, what would happen is that if I woke up in 
the READY state, the rest of the processes would have either aborted or committed. So 


essentially, they would be done. 


In this case, only the processes that would have aborted would have been done, but the rest 
of the processes would actually be there in the system, they would not have completed 
because the thing is that after READY, a Prepare-commit message has to be sent, which 
has not been sent. So those processes would still be there. this is what makes the entire 


difference. 


Since those processes are still there, they can be queried about their current state. And if 
all of them are in their READY states, none of them is in their PRECOMMIT state, then it 
can safely be assumed that the coordinator crashed before sending the Prepare-commit 
message. And then all the processes can move from the READY state to the ABORT state. 


So they can just be safely aborted. 


So this is the main advantage of this protocol that even if the coordinator fails, even if a 
process fails we will still not have a situation where we need to block. We can always make 
a decision and the decision in this case will be correct. So let me go over this, this is the 
prime primary advantage of the 3-Phase protocol, which is that as compared to the 2-Phase 


protocol, where if I woke up in the ready state. 


There is a possibility that the coordinator would have failed and the rest of the processes 
would have just completed their execution. They would have either aborted or they would 
have committed so that, like the protocol would be over and I might not get what is the 
current state. In this case, only those processes, so only the processes that would have 
aborted would be out of the system, if there are any, otherwise, all the other processes, if I 


am moving towards the COMMIT would be there in the system. 


So they can always be asked for their current state. And if all of them are READY, we can 


safely assume that the coordinator crashed before sending a Prepare-commit message. So 
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then we can simply abort all the processes and take them to the ABORT state. And that 


would be correct. So, so what again is the key, key, most important insight? 


The most important insight is that a PRECOMMIT state over here stops a process from 
actually leaving the system. So the processes will still be there in the system. And because 
they are there in the system, we can query their state and figure out what the coordinator 


must have done. 


If I fail, and then I, so I being a process, fails, and it wakes up in the PRECOMMIT state, 
well, no problem. This means that it must have sent a Prepare-commit message from 
READY and gotten a Ready-commit in return. So in this case, all that needs to be done is 
we need to elect a new coordinator that will send a global commit message and commit all 


the processes. 


So, they do not move automatically. It is still done via coordinator to ensure that everybody 
commits. And the coordinator sends the Global-commit message to all, and ensures that all 
the processes commit. And in this case, the reason we can safely commit is because all the 
processes have already agreed that they want to commit. That is why they have come via 


this route. 


So all the processes have agreed that they have a desire to commit, and we have also 
verified that all of them have a desire to commit. So that is the reason the process entered 
the PRECOMMIT state in the first phase. So that is the reason we just need to commit all 


the processes, which can easily be done. 


Now, the coordinator fails in the WAIT state, so the coordinator would fail in the WAIT 
state. This means that it would have not gotten all the Vote-commit or Vote-abort 
messages. So the processes will timeout in the READY state and ultimately they will elect 


anew coordinator, which will take a decision accordingly. 


If the coordinator times-out in the PRECOMMIT state, well, that also means that the 
coordinator is in this state. So which means it has sent all the Prepare-commit messages, 


so all the participants will also be there in the PRECOMIT state, they will timeout, they 
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will elect a new coordinator, which will commit all the processes. So as we can see this 


algorithm has inherent advantages, the 3-Phase Commit algorithm. 


And it stops the, the biggest problem that it stops is that it stops the blocking problem of 2- 
Phase Commit. So 2-Phase Commit had a problem of blocking. So this particular problem 
is not there in the 3-Phase Commit protocol. So, this does not have that problem. It is 
always possible to make some decision, which is correct. And the entire advantage is 


accruing from one additional state that we added between WAIT and COMMIT. 
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So this completes the lecture. So the original slides were in Andrew Tanenbaum Maartenn 
Steen’s book, and they have been very grateful to allow me to present the slides in my 
class. And you can always find the original slides on this link. So Virtual Synchrony and 
2-Phase and 3-Phase Commit protocols are otherwise also very popular. So the viewers 
will get a lot of resources on the web that describe these protocols in detail, and also the 
Cornell University has a toolkit that implements Virtual Synchrony. So that toolkit can also 


be used for physically, running a system with Virtual synchrony. 
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Advanced Distributed Systems 
Professor Smruti R. Sarangi 
Department of Computer Science and Engineering 
Indian Institute of Technology Delhi 
Lecture — 16 

Bitcoin and Blockchain Technology 
Welcome to the ultra-short lecture on Bit Torrent. So, Bit Torrent is one of the most popular file 
sharing programs peer to peer, network based file sharing programs as of 2021. And this is based 


on the familiar technology of DHTs; so, this is an ultra-short lecture. 


But, before you go forward, I would request all of you to take a look at the videos for Pastry and 
Chord that are a part of this lecture series; because without understanding Pastry and Chord, this 
lecture series will this lecture will not be comprehensible. So first, take a look at that and we will 
only outline few of the basic points, salient features of Bit Torrent. The rest will be fairly clear to 


somebody who understands distributed hash tables; so overview. 


(Refer Slide Time: 01:08) 
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@ Made to distribute large files. 
@ First create a Torrent descriptor file 
@ Details of the file. , 
+ @ Acryptographic hash of the file's contents 
@ Stored and distributed to search engines. 


® The user joins a swarm of hosts. 
@ Each host is a simultaneous downloader and uploader. 
@ |DEA : Break a large file into multiple segments (256 KB) 


© Distribute the pieces to peers. 
@ The peers can subsequently re-distribute the pieces . 


@ A BitTorrent client can simultaneously download the differ- 
ent pieces from different hosts. 


Sm R SaarguBMane 
So, as compared to Napster and Nutella, that were typically for smaller files like mp3 files and 
music files. BitTorrent was made to serve large files, or large video files. So, it was the main aim 
was to distribute large video files. So because of that, it is necessary to kind of re-architect our 
system. So, the user first the user who is sharing the first creates what is called a torrent descriptor 


file. The torrent descriptor file has the details of the file, a cryptographic hash of the file’s contents. 
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So, the cryptographic hash is required for integrity; because since we are talking of a large file, 
and we will discuss how it is actually served. It is possible that some bytes may develop a fault. 


So, because of that a cryptographic hash is required; so typically, the MD5 hash is used for this 
purpose. 


And then it is stored and distributed via search engines or via peer to peer system. So, the user 
joins a swarm of hosts; it can simultaneously be a downloader and uploader. So, we have been 
seeing the same format in other p2p systems as well. We have seen the same in Napster same in 
Nutella; that the user actually shares a shared directory, where you have songs and videos and so 


on. 


So, since we are talking about large files, such as videos, we break a large file into multiple small 
segments. So, each segment is 256 KB and these are distributed to peers; so, these pieces of files 
are distributed to peers. So, this allows the client, the BitTorrent client to actually download all of 
these segments in parallel. So, this increases the bandwidth and also reduces the time needed to 


get a file; and furthermore, it increases the robustness of the system. 


So, the peers can themselves re-distribute the pieces. So this will further add to the robustness; and 
pretty much for every file, we will number the segments 1, 2, 3, 4 and so on; and we know how 


many segments there are. 


So, these segments will then come from different parts of the network. So, the BitTorrent client, 
which is a piece of software that every user needs to install, can simultaneously download the 


different pieces from different hosts. 
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@ Each file has a dedicated 
@ Torrent file -» metadata, hash 
0 A tracker -» a server, which co-ordinates the process of 
downloading the file 
@ Approach: Connect to the tRacker, which has a list of peers 
that contain the different pieces. Connect to the peers to 


get the different pieces. 


Alternative Approach 


Do not use a tracker. Use a DHT instead. This will help you lo- | 
) cate all the peers that contain a given piece. Refer to the Main- | 
/ line DHT (uses Kademlia). 
t oH 


Smruti R. Sarangi BitTorrent 


So, the key elements in BitTorrent are like this. One is a torrent file, which contains the metadata. 
Metadata means a description of the file that will be used for searching and the hash. And then we 
have a specialized entity called a tracker, which is a server; so this used to be pretty popular in the 
early days of Bitcoin. So, the tracker was coordinating the entire process of downloading a file. 
This means that the approach would be to connect to the tracker, so the client would connect to 
the tracker server. This would have a list of peers that contain the different segments; and then the 


client would connect to the peers to get the different pieces. 


But, now the tracker has gone away mainly because of legal issues. You do not want to have one 
server which has a list of the entire network. So, instead of the tracker, this has been replaced by a 
DHT. And the DHT will help you locate all the peers that contain a given piece, using our same 
DHT mechanism that we have studied in Pastry and Chord. And the DHT that is used is called the 
main-line DHT, which basically uses the Kademlia protocol. So, recall that in the last few slides, 


so the Chord lecture we did discuss the Kademlia protocol to a certain extent. 
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Overview 


Downloading and Sharing Files 


@ Users need to use regular search mechanisms to find Tor- 
rents of interest. 


@ Similarly, if a server has a new file it hosts it, and distributes 
the Torrent file. It is known as the seeder . 


@ Once a client gets the Torrent file, it connects to the tracker, 
and gets the list of peers. 


@ Downloads the”pieces in a random order. 


@ Different strategies: 
@ Prioritise traffic for those nodes that have sent a lot of data 
: on the network. 
4s) @ Asender will preferentially send data to the nodes that have 
sent it data in the past ( tit for tat ). 
/ @ Keep some bandwidth for yourself, and some for others . 
. a > 
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So, downloading and sharing files will users need to use regular search mechanisms to find torrents 
of interest. Similarly, if a server has a new file; it will host it and distribute the torrent file so it is 
known as the seeder. So, once the client finds a torrent file, it will connect to the tracker; or it will 
use the main-line DHT and will download the pieces in a random order; so, there is no fixed order. 
So, you can download piece three piece one, piece two, piece four in any order; so, there can be 
different strategies. So, we can prioritize traffic for those nodes that have sent a lot of data on the 


network. 


So, if let us say that I have been a very active uploader in a sense, I have been very actively 
supplying my files; I should get some priority while downloading. And also tit for tat relationships, 
in the sense if I gave you something; then you will also give me something back with a high 
priority. And furthermore, I can reserve some bandwidth for myself and have some bandwidth for 
others. So, the main problem actually that happened with Bitcoin is that again, we go back to 
college students. So, what they were doing is that they were sharing a directory; and the directory 


used to have these files and their associated torrent files. 


So, all day others were downloading unbeknownst to the sharer; so that was eating up a large part 
of their bandwidth. And when they wanted to download, they did not have enough bandwidth. So, 
some of that just modern clients are configurable; so some bandwidth can be reserved for oneself 


and some for others. 
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_ Security and Privacy 


Security and Privacy 
@ No anonymity or security. 
@ The legal onus is more on the site that indexes the Torrents. 


@ Nevertheless, everybody involved in the hosting and propa- 
gating of copyrighted or illegal material is culpable. 
@ Depends on the specific country. 
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Security and privacy, so as such bit a BitTorrent does not provide anonymity or security. And 
furthermore, the onus is on the site that indexes the torrents like the tracker sites. Even without 
that everybody involved in the hosting and propagation of copyrighted or illegal material, in a 
sense is legally culpable. So of course, to what extent it is enforced depends on the laws of the 
specific country. But there are two broad approaches; either we use a tracker server that provides 
a directory or we use a DHT. And then the problem with a DHT is, it will require multiple hops; 


but again the legal liability is much lower. 


And furthermore, there is more robustness as well as it is easy to locate; and given the fact that it 
will take proportionally much longer time to download the entire segment. Locating a node that 
has the torrent file will not take that much of time. Plus, these are not strictly real time tasks; so 
we do not really need to worry about the latency to that extent. So, big torrent as of today is banned 
in a lot of places, particularly university campuses, regardless of whatever you are using tracker 


or DHT; so that needs to be understood. 
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Overview 


_ Searching for Torrents 


@ Mainline DHT is the largest DHT in the world with some- 
where between 10 million to 25 million computers. 


@ All the current versions of the BitTorrent clients are compat- 
p, ble with Mainline DHT. 
@ Alternative approach: 
( @ Use a gossip based protocol to have BitTorrent directories | 
among the peer nodes (Tribler). 
a Use anti-entropy to regularly exchange list of Torrents. 
{ @ Since there are too many Torrents, the software gradually | 
~ jearns the user's interest and filters the Torrents. 
t pi 
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In spite of that, BitTorrent is extremely popular. So, the main-line DHT which BitTorrent uses is 
the largest DHT in the world. So, it does have somewhere between 10 million to 25 million 
connected computers. So, BitTorrent is clearly the largest file sharing system in the world at the 
moment. And all the current versions of the BitTorrent clients are compatible with main-line DHT; 


but they can connect to trackers as well. 


Furthermore, BitTorrent is expanding, or rather I would say has expanded and it uses other kinds 
of protocols. For example, it uses a gossip based protocol. So, basically to synchronize BitTorrent 
directories to implement BitTorrent directories among the peer nodes; so, this protocol is called 


Tribler. 


So, this is again a gossip based thing where I just maintain a directory of file names and servers, 
and we periodically exchange and update; so, go back to the lecture on epidemic and gossip based 
algorithms. So, we use anti-entropy to regularly exchange the list of torrents. And furthermore, 
since there are lots and lots of torrents, the BitTorrent software also gradually learns about the 
user's preferences, and filters the torrents; and essentially stores those torrents that are more aligned 
to the user's viewing preferences. So this in a nutshell was BitTorrent. We did not discuss much 


about the Kademlia protocol or the main-line DHT. 


But, I my feeling was that whatever we discussed towards the end of code is enough to give an 


introduction to Kademlia. And the protocol of course, can be read up on the web. But, the main 
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idea with BitTorrent should be clear that it is clearly the largest DHT in the sense that runs in the 
world; and it uses other methods also. It uses other methods also that include gossip based 


algorithms and trackers. 
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So, the BitTorrent Wikipedia article can give a quick introduction. If you want to know more about 
BitTorrent, you can always read this paper by Izal, Mikel, et al. And it talks about five months in 
the torrent’s lifetime; so it will tell you everything about it. So, this lecture pretty much finishes 
our discussion on DHTs; we have discussed quite a few; we have discussed the Pastry. We have 
discussed Chord, we have discussed T Pastry, Kademlia one slide each; and now we have 


discussed a system made on a DHT, the main-line DHT the BitTorrent system. 


So, subsequently we will move to the second half of the course. So, the first part was essentially 
DHTs and epidemic gossip based algorithms and so on. So, the second half of the course will 
basically look at distributed algorithms. And that is important because, once we have DHTs are 
only one kind of a distributed algorithm; but there are many more types and all of them are 
required. And finally, we will use the results of parts one and two to create actual systems. So, we 
did see one actual system, BitTorrent is an actual system; but, we will create bigger systems that 


use the results taught in parts one and two of this course. 
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Advanced Distributed Systems 
Professor Smruti R. Sarangi 
Department of Computer Science and Engineering 
Indian Institute of Technology Delhi 
Lecture — 17 
Coda File System 


We will discuss the Coda distributed file system in this lecture. 
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So, a distributed file system typically has a lot of client caches; so, we will discuss that in some 
detail. Then, we will discuss the semantics of the file system, finally replication. So, first we will 
discuss the broad overview of the design and then the details of the design, and give a little bit of 
an introduction to the evaluation. So of course, the detailed evaluation will be there in the paper. 
So, in this presentation we will just be covering the main points. So, a distributed file system is 
something like this. So, it is very much similar to what many of us would have seen in the 


universities that there would be a centralized file server; and there would be a set of clients. 


So, client can be a laptop, a desktop does not matter; so all of them would be connecting to this 
file server. So, nothing would be stored on the laptops or on the desktops. So, these machines are 
also known as client machines; and the client can be really thin. So, these are also called thin 
clients; which means that is just just about a monitor with a very bare bones processor. So, the thin 
client can also connect with the file server. And the entire set of files will be resident resident on 


the server is just a small part will be brought in whatever the user wants to work on. 
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And after the user is done, they get transferred back to the file server. So, the advantage here is 
that I can do something on my laptop, I can disconnect it. I can then go to let us say some other 
machine that is I can go to a desktop machine; then I can connect that to the file server. I will 
immediately get back the same state of files that I had left when I was working on my laptop. So, 
it is like a large virtual hard drive that we want to implement. The same set of files seen across all 


of these machines. 


So of course, all of these machines in this case are the clients; and we need not have a single server, 
so we actually can have a set of servers. And so they will be storing replicas of the files; so that if 


one of the servers fails, the rest of the servers can take over. 
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@ Coda is a large scale distributed file system 
@ Provides a high level of resiliency: 
® Tolerates server failures by having replicas 


@ Allows for disconnected operation. A client can temporarily 


act as a server 
*@ Efficient and easy to use. 
@ Location transparent. ws, 
@ It extends the Andrew File System (AFS) 


So, Coda is one such large scale distributed file system. So, it provides a very high level of 
resiliency, in the sense that it is not a single server; it is rather a set of servers. And it tolerates 
server failures by having a lot of replicas. So, the replicas ensure that if one server is down, the 
rest of the servers can serve the request to the client. And furthermore, it is possible. It allows for 
disconnected operations, which essentially means that if I have connected a laptop to the server; 
and let us say I decide that Iam connected by a network connection; and let us say drop the wireless 


connection. 


So, then I can still continue to work on my laptop with whatever files I have cached locally. And 


then once I reconnect back, so once let us say reconnect back, then whatever changes I made are 
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synced to the server; so it is rather efficient and easy to use. And of course, all file server systems 


have to provide something called location transparency. 


So, location transparency basically means, it does not matter where you physically are. So, I can 
take my laptop, go to another; say this is the globe; I can start from the South America. And then 
sorry, Iam a bad drawer; and then from there, I can go to Japan. And then I can plug in my laptop, 


and will get back the same set of files; so location does not matter. 


So, there are two kinds of very basic file systems NFS and AFS. Most university campuses have 
NFS, where there is one central server. And all the home directories of students are mounted from 
NES. The other is the Andrew File System which is also reasonably popular. So, Coda extends the 
Andrew File System AFS. 
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Historical Overview 


@ Coda arose out of AFS. 
@ It needed to provide more fault tolerance. 
@ Aim: Constant Data Availability 
@ Provide data availability in spite of failures in the system 
@ Was meant to integrate portable computers in the file sys- 
tem network (read laptops). 
@ Need for compatibility with Unix file semantics. 
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So, it arose out of AFS primarily because the AFS versions at that time did not really provide the 
amount of fault tolerance and data availability that Coda actually needed. So, what we know by 
the CAP theorem is that if we want reasonably high data availability, if we also have to want to be 
reasonably tolerant with partitions, then we our consistency has to come down. So, these are some 
of the difficult decisions that designers had to make when they were actually designing Coda; and 


so this was a kind of tough decision for them. 


And we will see what were their decisions and what were the tradeoffs in the sense that they had 


to make. And one more need was that when this paper was being published, increasingly laptops 
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were becoming very popular; they were becoming very prevalent. So, what was really happening 
in the world of laptops is that people were connecting to the file server for some time, getting a 
dump of the files, a view of the files; and then they were disconnecting from the network, because 


network connectivity was not ubiquitous. And then they were working on their personal laptops. 


So, managing portable computers was definitely one of the issues. And furthermore, there was a 
need for some degree of compatibility with existing Unix like file semantics. So, this was one more 


of the key design decisions that was made. 
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So, the first thing for us is to understand what is it that AFS provided and how did Coda extend 
that? So in AFS, what happens is if a client opens a file, the entire file is fetched from the server. 
So, which means that if I have a client over here, and then I fetch a file from the server; what comes 
to me is the entire contents of the file, the entire file that comes and it resides in the clients cache. 
All the read and write operations then are directed to the local file system; because the file in this 
case is cached locally. And then when the file is actually closed, then the contents of the file are 


sent back to the server. 


So, what happens? So, initially if let us this is the client, this is the server; the client sends an open 
call to the server. So, what the open call does is that it returns the entire file. Also, what the client 


does is it establishes a callback with the server, which means that if there is a simultaneous 
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conflicting access from another client. Then, the server will let this client know that the copy of 


the file that it has is actually invalid; so there is a concurrent modification. 


So in this case, the moment that the client C receives such an invalidate message. So, this process 
is known as breaking the call callback. What the client will do is that it will essentially discard the 


contents of the local file; so the operation that is going there. 


So, let us first assume that the client has not opened the file; and so the client has not opened the 
file, then that is okay, I will just mark the copy as invalid. But if it is this case, that the client has 
opened the file, it is working on it; then, it will just keep on working as it is. And once it is done, 


it will send a close request. 


So in this case, the semantics is that the last writer wins. Which means that if let us say two clients 
C and C’ have opened the file at the same time and they are working whoever commits the last, 
whoever commits means, closes the connection at the end. The right of that file stays in the other 
write gets overwritten; so that is the last writer wins. But, assume that let us say this client has 
finished client numbers, the client C has finished its access; the local cached copy that exists, that 
copies discarded. But, if concurrent requests are alive at the same time, then of course the last 


writer wins. 
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@ Observation: Caching is key to the efficient performance of 
AFS. Better is the cache, better is the performance 


@ Clients cache entire files in their disks. 
@ Uses the AFS caching mechanism as a baseline 
@ Check the cache on a file open() call 
@ If the file is not there, fetch it from the server 
@ If the file has been modified, then write it back to the server 
after the close() call 
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So, now let us look at the caching semantics. So of course in AFS, every client has a local cache; 
and in the local cache, it stores all the files that it has ever cached. So, a file is removed from the 
crash from the cache, only if it receives a callback broken message where it needs to invalidate a 
file. And why would a callback be broken? That callback would be broken at the server if another 
client writes to the file. So, then the server which in this case maintain state, let us client C know 
that a given file is not valid anymore. So, how does AFS manage the caching? Well, it checks its 


cache on a file open call. 


If the file is not there, it fetches the file from the server. And if the file has been modified, then of 
course, you need to write it back to the server after the close call. So, this ensures that files can be 
dropped from the clients cache seamlessly; even if it is modified, it is not an issue. Because, after 
it is closed, it is in a sense not modified anymore; because the client copy and the server copy are 


the same. So in this case, the file can be dropped from the clients cache. 
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Coda Semantics - | 


@ One-Copy Unix Semantics: Modification to any byte in a file 
) _ is immediately and permanently visible to every client. 


@ AFS-| Semantics: Propagate changes at the granularity of 


files (at the time of open and close only). 
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@ The client sets up a callback mechanism with the server 
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So, now let us look at the semantics of Coda. So, Unix has a semantics called the one-copy Unix 
semantics, which mean that modification to any byte in a file is immediately and permanently 
visible to every client. So, the one copy semantics in Unix is okay if you are talking of a single 
machine; but of course, it is not okay it is not acceptable for a distributed file system. So, we have 
the semantics of AFS-I, which propagates changes at the granularity of files at the time of open 


and close, which we just discussed. 


AFS-II does something better, which we also discussed, which is that the client sets up a callback 
mechanism with the server; and the server is aware of the files that are cached to the client. So, 


whenever a file changes, the server will notify the client by breaking the callback mechanism. 


And so this mechanism of letting clients know that look your cache files are not valid is called 
breaking the callback. If there is a network partition, of course, then the client cache remains in 
coherent; so that is the server is over here, client is over here. It remains incoherent and nothing 
can be done. So this as far as Coda was concerned, the Coda paper AFS-II was the latest semantics 


that it had seen. 
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@ Coda uses a set of servers S. 


@ Aclient maintains a subset of servers/s C S that are reach- 
able. 

@ Every 7 seconds, a client recomputeg S, 

@ Onan open() 

oA client gets the latest version of a file from s 

@ If(s = o)then it uses its cached version, 

@ Relaxed consistency: It considers a file valid, if it was the 
latest copy at some instant in the last(r)seconds, and a call- 
back was lost 

@ Ona close() 
is e A client propagates the update to all of) 


So, now let us take a look at the Coda semantics. So, in a coda semantics, it uses a set of servers 
capital S. So, a client maintains a subset of servers S, which of course S is, as you can see small s 
€ capital S. And this set of servers small s is the ones that are actually reachable; because note that 
all the servers that store replicas may not be reachable. Say, every T seconds what the client does 


is it pings all the servers and re-computes the set S. On an Open message, what the client does is 


that it gets the latest version of a file from s. 


And so, we will see there is a way to ensure that a version is the latest version of a file. If of course 
if s = d, which means that if it was not able to contact any server, then the client uses its cached 
version if it has one. In addition, what Coda actually introduced, so we discussed that availability 
and partition tolerance cannot be achieved with a strong consistency guarantee. So, consistency in 
a certain sense has to be relaxed; so Coda does that it relaxes to consistency. So, it considers a file 
valid, the contents of a file to be valid. If it was the latest copy, if the file was the latest copy at 


some instant in the last T seconds. 


So, what it does is that if I were to consider this as a number line, and let us see if this is like the 
current time t = 0; and let this be t = - T. Within this time, at any instant within this interval, if the 
contents of the file were valid; and after that, so it does not matter, the files contents will have 
changed after that. But, if the callback messages, the callback breaking messages, informing the 


client that the contents are not valid if those messages were lost; we will still consider the copy 
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that the client has to be a valid copy. So, this essentially is a slightly a weaker version of synchrony, 


where we are allowing errors to scatter survive for T seconds. 


So, errors means stale copies of files. So, file is not considered to be stale, if it was modified in the 
last tau seconds; and the modification and the message informing the client about the modification 
was lost. Once a file closes, the client propagates the update to all of smallest; so all the servers 


are informed, notified, intimated about the update. 
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@ Unit of Replication: A volume (a set of files and directories, 
subtree of the shared file system) 


@ Each file or directory has an unique ID 
@ A part of this ID identifies the parent volume. , 
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volume storage group (VSG) ( ‘ 
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So, now let us discuss replication in Coda. So in Coda, we will not use s and capital S; since instead 
we will give them slightly technical names, technical terms. So, we will talk about the unit of 
replication, say volume which is a set of files and directories. Let that be a subtree of the shared 
file system; so, let us call that a volume. So, volume can also be the entire file system or it can be 
a large subtree of the file system; so let us call this a volume. So, each file or directory within a 
volume has a unique ID. So, this is also something that AFS introduced, that every file within a 


volume the moment it is created, is given a unique ID, which is allows us to identify it. 


So, it is not it is path; it is a unique file specific ID. So, part of this ID indicates the parent volume; 
the volume it is a part of and the other is the file ID. So, for every volume, we have a set of servers, 
which store replicas of the volume. And this set of servers is known as the volume storage group, 


which can be thought as the set capital lists that we were talking about. The list of servers are 
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stored, of course in a volume replication database where we know what are the contents of the 


VSG. And every client, so every client has a cache manager; it is called venus. 


So, this is also the same term that AFS used. So, what the clients cache manager will do is, it will 
keep track of the subset of the VSG that is currently accessible; that is currently fault free and that 
is accessible over the network. So, we will call this as the AVSG or the available servers in the 


volume storage group the AVSG. 
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Replication Strategy 


@ Upon a cache miss, a Client obtains the file from one mem- 
® ber of the AVSG. (Preferred Server) 


@ The preferred server can be chosen on the basis of physical 
proximity. 


@ The client contacts the other servers on the AVSG to verify 
that the preferred server has the fatest copy of the data 


@ If the preferred server is outdated, then the server with the 
latest copy is made the preferred server. 


@ Establish a callback with the preferred server. the | F 


@ Upon a file close — it is transferred to all the members of the 
AVSG: 


So, what is the replication strategy here again? So upon a cache miss, a client obtains the file from 
one member of the AVSG. So, the client what it does is that within AVSG, one of the members is 
designated a preferred server. We will see that this is a standard approach in the sense that many 
many proposals in this space in the in the world of distributed systems. Distribute the replicas 
among a Set of servers; one among them is preferred, it is called preferred server. So, the preferred 


server in this case can be chosen on the basis of physical proximity or other considerations. 


So, the preferred server is the one that is contacted the first. So, the client then contacts the other 
servers on the AVSG to verify that the preferred server has the latest copy of the data; so this has 
to be the case. So, of course, the preferred server is outdated. Then, the server with the latest copy 
is made the preferred server, so you change the preferred server. Then, similar to AFS like AFS, 


we establish a callback with the preferred server; which means that if there is a modification by 
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another client, then all the cache copies are invalidated. And similarly upon a file close, the 


contents of the file are transferred to all the members of the AVSG. 
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So, also recall that this idea of this AVSG is nothing but a process group, which is something that 
we studied in the lecture on virtual synchrony. And some very similar concepts will come up over 
here in the space of a distributed file system, where we will be talking about virtual synchrony. I 


mean something concept similar to virtual synchrony; and how they are that relevant over here. 


So, since we have given ourselves a window of T seconds, almost all the events in our system have 


to be recognized within the last, within T seconds. So, one of the events could be enlargement or 
the AVSG, which means that some server that was not available, became available. Recall that we 


had a similar issue with virtual synchrony as well. 


So, how do I detect that the AVSG has enlarged? Well, we contact missing members every tau 
seconds. If the AVSG expands, then it is possible that cached files maybe out of date. For example, 
it is possible that these servers are alive when a client is fine with them. But, then this server has a 


more recent update. 


And it is possible that with this there might be a conflict. So, then what will happen is that for all 
of these files, Coda is pretty much going to drop the callback; which essentially means that if the 


AVSG expands, Coda will automatically assume that these files maybe out of date. And this means 
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that if any callbacks have been registered in any server; those callbacks will be removed, they will 
be dropped. Either file is concurrently not being modified; then of course, the cached copy will be 
kind of thrown out of the cache. And next time the new AVSG, which is all of these servers will 


be contacted for the latest update. 


So, one thing that you quickly need to note is that of course the idea of virtual synchrony can fit in 
very nicely into this. That of course, virtual synchrony’s main idea is anything that happens in the 
view remains within the view. So, any message or any action that was happening in the previous 
AVSG that is stopped; and definitely, once the AVSG expands the cache, the callbacks are 
dropped, the cached copies are invalidated. It is not necessary to do all; but of course, some nuances 


possible here. 


And then the next time that there is a request from the client, the new AVSG is contracted for the 
latest copy. Similarly, it is possible that the AVSG can shrink; so, this is also detected by probing 
each member every T seconds. And of course, it is also possible that the preferred server dies. Say 
that happens, then of course the Venus client system will remove the callbacks from the preferred 
server; in the sense it will make a record that the preferred server will not send it callbacks; and it 


will try to work with the shrunk AVSG. 


So, since the AVSG shrinking, there is really no great fear that an update will be missed. But of 
course, it is possible that another client might send an update to the preferred server. But of course 


in that case, since we have gone with the last right wins philosophy that will be taken care of. 
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@ Loss of a callback event. 
@ Upon a read, the client verifies the version of the file in the 
‘ preferred server with that of other servers in the AVSG 
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So, what happens if there is a loss of a callback event? Which means that a callback is kind of 
missing the message got lost. So, upon a read, the client verifies the version of the file in the 
preferred server with that of other servers in the AVSG; so, we have already seen this. So, what 
we have essentially seen is that out of all the servers in the AVSG, one of them is the preferred 
server. So, what the client does is that it contacts the preferred server to find if its updates are the 


latest. 


Otherwise, if that is not the case, and of course it takes the updated verifies with the rest. If there 
is amismatch, then most likely there is a missing callback. So then, there is a need for some degree 


of reconciliation. Then, we will see what the reconciliation is in the next few slides. 
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@ Design Details 
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@ Each remote operation typically requires to contact multiple 
servers. 


© Coda provides nultiRPC Yor this purpose. 
@ MultiRPC uses the multicast capabilities of the network 


Now, let us discuss the communication aspect of the design details of how exactly this is done. So, 
given that we have seen the broad idea of the Coda system, which is basically we have a client; we 
have a set of servers. One among them is the preferred server; and the set of servers is known as 
the AVSG. There is a need that we need to read a version from the preferred server; and then verify 
it with the rest of the servers. And essentially verify it is the most latest up to date data. And the 


rights clearly have to be sent to everybody in the AVSG. 


Of course, we also nuanced this position later; but at the moment, let us assume that it is sent to 


everybody. Can be sent to a quorum, but I will discuss that when we discuss the Dynamo paper. 
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So, Coda uses multicast RPC for this purpose instead of a unicast message. It relies on multicast 
messages, which most networks natively support. So, multiRPC is clearly more efficient, much 
much more efficient, far more efficient than a unicast method. So, this is where there is some 
advantage that we see from that accrues from the properties of the network; and will also measured 


this. 
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Disconnected Operation 


@ Disconnected Operation begins when the AVSG is empty. 
@ If there is a cache miss in disconnected mode there is a 
problem, 

@ Venus tries to minimize cache misses by using the LRU re 
Coe Placement policy. 
cr @ Coda also allows the user to specify a priority for files 

@ High priority files are not removed from the cache. 
@ Allows the user to annotate a sequence of actions. 


Every file generated as a result of those actions is denoted 
as{stick)> 


Let us now discuss disconnected operation, which is one of the main strengths of Coda. So, 
disconnected operation begins when the AVSG is empty, which basically means that the client is 
disconnected from the server. So, the link between the client and the set of servers all the servers 
rather in the AVSG is lost; so this is a problem. However, the main aim was that we should not 
perceive this to be a major problem in the Coda distributed file system. So in this case, the client 
cache manager Venus tries to minimize the cache misses. Why is this the case? The client has a 


cache, which is of course managed by Venus. 


And what Venus tries to do is that it uses least recently used cache placement policy eviction 
policy, to ensure that all the files that are recently used are kept in the cache to minimize your 
chances of a cache miss. So, this is a software cache that resides on the client. Furthermore, Coda 
allows the user to specify a priority for files, so the high priority files are not removed from the 


cache. 
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In addition, it is also possible to annotate certain sequence of actions, such that anything that is 
produced as a result of these actions is marked to be sticky. For example, we can have a sticky 


open. And anything that is opened via this these mechanisms, gets the highest priority in the cache. 


(Refer Slide Time: 28:53) 


Reintegration 


@ Happens after disconnected mode ends (one of the servers 
in the AVSG is up) 


® For each modified file, updates are propagated to the servers 


in the AVSG. 
@ Proceeds top-down from the leaves. ihe. | 
@ There might be contlicts., hb Oy 


@ Provide a temporary home for storing the client updates (co- 
volume) ——_ 

® Similar to lost+found directory in Unix 

@ Let the client resolve the updates later 


Once the disconnected mode is over, which means the client is reconnected; so, which means one 
of the servers in AVSG is accessible. For each modified file, that updates are propagated to the 
servers in the AVSG. So, the what the client does is, it takes a look at all the modifications; and it 
propagates these modifications to the servers in the AVSG. And in the cases for a file, of course it 
is easy; it is treated as normal. But, for directories, it is kind of hard; and so we will discuss this 
slightly later. But also another thing is that there might be conflicts; in the sense we might be 


dealing with concurrent writes. 


So, in this case, what a server does is that it provides a temporary home for storing the client 
updates; so this is referred to as a co-volume, which is essentially a space within the server; where 
it is like one directory, where all the conflicting files are stored such that they can be reconciled 


later. This is very very similar to the lost plus found directory. 


So, many of you actually take a look at your home directories, even on an NFS server; you will 
find a directory called lost plus found. This essentially refers to all of those conflicting files, which 


had some issue with the server; and they could not be automatically reconciled. And so in this 
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case, they are kept in the lost plus found directory. So, this means that the client can update, resolve 


all of these updates at a later point in time. 


(Refer Slide Time: 30:48) 


@ When a user voluntary disconnects her laptop. 


t» © She relies on the large file cache 
p @ She needs to re-synchronize later 


Distributed Fle Systom 


So, when a user voluntarily disconnects a laptop, she essentially relies on the large file size and 


then whatever writes that she does, has to be resynchronized at a later point in time. 
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@ When a conflict is detected, Coda tries to resolve it auto- 
matically. 


o Easy to automatically resolve conflicts on directories. 
@ There are three kinds of conflicts that cannot be automati 
cally resolved 
@Cupdate/updaté ponflict: The status of the same object is up 
ay dy dated differently in different partitions 
Kink’ conflict: Updating an object in one partition 
and removing it in the other 
e) ne/nar jonflict Two files with the same name are cre- 
ated 


({ @ Coda has specialized repair tools that allows the user to fix 


: these conflicts 
TMovtuyh 
] @ The user can see all the replicas. 


Now, let us look at the most important aspect of Coda, which is conflict resolution. So, when a 
conflict is detected, Coda will try to automatically resolve it. So, what we will see is that conflicts 


and files are easy to resolve; it is also relatively easy to resolve conflicts and directories. But, there 
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are three kinds of conflicts in directories which cannot be automatically resolved. One is let say 
update/update conflict. So, this is where the status of the same object is updated differently in 


different client partitions. For example, this can refer to the security settings. 


So, in this case, it is possible that the access control of a certain file is changed in a certain manner 
on this client, and in a different manner in other client. So, this would be an update/update conflict 
in the directory. We might also have a remove update conflict, where file is removed in one 
partition and updated in the other. Also name/name conflict, where two files, of course, with 


different contents are created. 


So, all of these things have to be kind of resolved manually at the level of the user. So, Coda has 
specialized repair tools that allows the user to repair these conflicts, fix these conflicts. So, the user 


can see all the replicas that are there on the different servers; and these can be manually fixed. 
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Design Detai 


Replica Management 


@ Each modification has an unique storeid*> 
@ The server maintains a history of storeids. 
@ If the history of storeids on server A is a subset of that in 
server B, then@ contains newer copies. 3 ~» clomvunl 
@ Coda will consider B to have the latest version. 4 > 5 ub 


@ This method is useful for files, but can be very conservative 
for directories p : oy 7 — 2 
@ Coda maintains the following information: 
® Coda maintains the LSID (latest storage id), and the current 
y length of the update history. 
@ LSID - client:<monotonically increasing integer 
/ e A replication site also contains the length of the update his 
tory of every replica. This is like a vector clock. It is called 
the CVV 


So, now let us look at the replica management. So, each modification that is made has a unique 
store id; so this is important. So, it is like immutable history. So, these are all very similar concepts 
as the original blockchain idea that we have discussed, that whenever any modification is made, it 
is essentially a new entry, and it is assigned a new store id. So, this is a very common idea, very 


common concept in distributed systems that any new update is given its unique new store id. 


The server maintains a history of store ids. So, the history of store id is on server A is a subset of 


that in server B; then you can be sure that B contains new copies. So, Coda will consider B to have 
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the latest version. So, B in this case is considered to be dominant; and A in this case is considered 
to be submissive. Why submissive? Because it does not have all the updates; and B have all the 
updates. So, the useful method for files, it can be reasonably conservative for directories; because 
of course in a directory, we might be modifying different files without an actual overlap; but Coda 


uses the same more or less for both. 


So, Coda maintains the following kinds of information. So, the first thing is that for every file, it 
maintains a latest storage id. So, the latest storage id is basically you can think of it as a scalar 
clock, which is a combination of a pid and a counter. So, this is clearly, it can be pid is like a client 
id and a counters; you can think of this as a scalar Lamport clock. And where every subsequent 
store id is guaranteed to be unique; and it is a tuple of the client id and a counter. And of course 
the, it contains the current length of the update history. And the LSID, as I just mentioned is a 


client, client id and a monotonically increasing integer. 


In addition, the replication site contains the length of the update history of every other replica. So, 
this is like a vector clock, where basically if you recall, a vector clock is essentially a vector for 
each of these elements is a count. And for ith entry, this is an estimate of 1’s local clock; and the 
vector clock as we have seen is very important. It can be used to infer causality; it can be used to 


infer concurrency and all kinds of things. 


So, this is a vector clock, which is called the CVV; it is a version vector. And so essentially with 
every object, it will maintain a latest store id, which is the latest version of the object. And also the 
number of updates that each site, each replicating site has actually made; and this information will 


also be there in this vector clock. 
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@ Strong Equality : LSID, = LSIDg and CVV, = CVVg 
@ Weak Equality : LSID, = LSIDg,and CVV,)¢ CWg, 
\,0 nil : LSID, # LSIDg and Vi, CW ali) ; CWeli] 

@ inconsistency : If none of the other three conditions hold. ves t« 

@ If there is strong and weak equality, the replicas are syn 
chronized. 

@ If replica A is dominating replica B, then replica B needs to 
be updated. 


Replica Management 


@ Each modification has an unique storeid’* 

@ The server maintains a history of storeids. 
@ If the history of storeids on server A is a subset of that in 
server B, then Bbontains newer copies. 3 » clom vu! 
Coda will consider B to have the latest version. 1 > 5 ub 


@ This method is useful for files, but can be very conservative 
for directories. pid a = 
@ Coda maintains the following information: 
® Coda maintains the LSID (latest storage id), and the current 
d length of the update history. — 
} @ LSID -yclient:<monotonically increasing integer>| 
[ @ A replication site also contains the length of the update his 
tory of every replica. This is like a vector clock. It is called 
the CVV 


So essentially, we are looking at the tuple of LSID and the CVV; so, let me just go through this 
once again. The LSID is a tuple of the client id and monotonically increasing integer. So, from the 
perspective of the client, this is a scalar clock. In addition, the CVV is a vector clock from the 
perspective of the replica servers. So, here this contains the length of the update history of every 
replica; so this kind of tells the servers how recent their entries are. So, now when we are 
comparing an LSID CVV tuple for a given file, we can have one of these four conditions; so let us 


see. 
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We can either have strong equality, which means that between the replica A and B, the LSID is 
the last store ids are the same; and the histories are also same. So, this basically means that both 
of the replicas are on the same page. So, their last store ids which are very much the last update 
that they have processed that is the same; and also they have the common view of the rest of the 
replicas, so they are equal. Then of course, we can have weak equality. In weak equality, the latest 


update that they have processed is the same; which means LSID A = LSID B. 


However, this part is important. Their view of the world, that CVV vectors are not the same; so, 
this is weak equality. So this can be that, they might have just missed some updates in the past; 
and then the replicas have not been updated, or some messages have been lost. So, that is why the 
CVV will diverge. But, the fact is that they are still, they still have the latest copy; because the 
LSIDs are the same. Then, we can have dominance; since dominance what will happen is of course, 


the LSIDs are not the same. But, the Vi, CVV, [i] = CVVz [i]. 


So, this is a typical vector clock condition, which says that A is the more recent. A clearly has far 
more up to date information than B; because the vector clock which is CVV. For every single entry 
is greater than equal to the corresponding entry of B; so in this case A is dominating and B is 
submissive. In inconsistency is when three of these conditions do not hold; and this is where some 
kind of resolution is required. So, the resolution subsystem has to be invoked. So, if there is strong 
and weak equality, we can say that the replicas are synchronized, in a sense the replicas or rather 


both the servers. 


Both of the replicas are fully the same, they are fully synchronized; and they have the most up to 
date data. But of course with the weak equality, one of the servers does not know enough about all 
the replicas; but that can be fixed. Say, if replica A is dominating replica B, then of course replica 


B clearly has old information which needs to be updated. 
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Outline 


@ Design Details 


® State Transformation +, 


State Transformation - Update 


@ Update: 
@ Most common operation — file create, delete, modification of 
permissions 
o First Phase 
@ The client sends the LSID and CVV to each AVSG server 
@ If there are no conflicts, the server performs the desired ac 
tion tno « ofl 4 
@ Second Phase : 
@ Each AVSG site records the clients view of which AVSG sites 
performed the update successfully 


Now, let us take a look at the entire details of the protocol, and how the state is transformed via 
the state transformation logic. So, what are the most common operations in a file system? File 
create, delete and changing permissions, modification of permissions. So, a typical Coda file 
operation is broken into two phases; a first phase and a second phase. So in the first phase, the 
client sends the LSID and the CVV; so, whatever CVV it knows, and the last store id to each 
AVSG server. If there are no conflicts, so you do not expect conflicts generally; but you can have 


conflicts, if let us say there was a disconnection and another client updated the value. 
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So, then the server essentially makes a temporary record; so this is something like two phase 
commit. So, this operation is very much like two phase commit, where if there are no conflicts, 


the server performs the desired action and keeps a temporary record. 


In a second phase each, so what happens is that the client. So, of course, from the first phase, it 
will go to the second phase, if there are no objections, if there are no conflicts. As I said, I drew an 
analogy with two phase commit, which is very apt in this case. So, then each AVSG site records 
the clients view of which the AVSG sites perform the update successfully; and the vector clocks, 
the CVVs are updated accordingly. So, let us take a look at far deeper look into the first and second 


phases. 
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Design Details 


Check at an AVSG Server 


@ The check succeeds for files if: 


@ The cached and server copies are the same. 
@ Or, the cached copy dominates 
@ The check succeeds for directories if; <q u 
@ When the two copies are equal 
@ If the check does not succeed: 
@ The client pauses the operation, and invokes the resolution 
subsystem 
@ If the resolution subsystem can automatically fix the prob 
+ lem, then the client restarts 
@ Otherwise, an error is returned to the client and the opera 
tion is aborted 
@ If the operation is successful, the server performs the action, 
notes the LSID of the client, and commits a temporary CVV. 


So, let us look at the checks that are performed at an AVSG server in both the first phase and 
second phase; so this is for the first phase. So, the check succeeds for files if the cached and server 
copies are the same; and it is a same LSID or the cached copy is dominating. So, then clearly the 
check succeeds, otherwise it fails. The check succeeds for directories clearly when the two copies 
are equal. So, this we have been arguing it is conservative, but Coda does follow this because 


otherwise we need to maintain a lot of state. 


Now, let us see what happens if the check does not succeed. So in this case, the client pauses the 
operation and it invokes the resolution subsystem. This tells the client that pretty much it has a 


missing update or it has a stale update. So, the resolution subsystem tries to automatically fix the 


546 


problem, which means fetch those updates that the client did not have and tried to perform a merge. 
If the merge is not successful, then an error is returned to the client and the operation is aborted; it 


is stopped over there. 


However, if the client has an up to date copy, then what happens is that all the servers perform the 
action; and the note the LSID of the client and commit a temporary CVV. So, it is important to 
understand that if the multiple servers and the client is over here. So, in phase one what happens 
is that the client sends the version to all the servers. And the servers then indicate if they are okay 
or not to go forward, or go ahead with update; and they also note the temporary CVV that the client 
is proposing. So there is the CVV, the latest CVV of the data that the client has; so this is phase 


one. 


So, in phase two, what happens is that after the client has gotten the responses from the entire 
AVSG and if there is no error, if there is no conflict; then, the client sends a commit message 
informing the servers that they are ready to perform the update on their site. And also each server 


is made aware of the rest of the servers that have agreed to perform the update. 


So, then what happens is that the server increments its own vector clock as well as the vector clocks 
of the the entries corresponding to the rest of the servers. So, this ensures that by the end of this 
process, by the end of phase two; the entire AVSG, they have the same CVV further data or the 


same vector clock. 
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Design Details 


Transformation 


State Transformation - Update 


@ Update: 
@ Most common operation - file create, delete, modification of 
permissions 
© First Phase 
@ The client sends the LSID and CVV to each AVSG server 
@ If there are no conflicts, the server performs the desired ac 
tion 
@ Second Phase 
@ Each AVSG site records the clients view of which AVSG sites 
performed the update successfully. ——— 
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And why is this is the case? This is a case because in phase two, if you would see over here, it 
records the views of each of all the AVSG sites that have performed the updates successfully. So, 
all of the AVSG sites at least that perform the update successfully are guaranteed to have the same 


value of the vector clock, which is the CVV in this case, a further explanation of this. 


(Refer Slide Time: 45:52) 


@ At the end of phase |, the client examines the replies from 
each server. 
@ For each responding server i, it augments CVV [i]. 


@ The client sends this CVV to every responding server 
@ Each responding server replaces its tentative CVV by this 
CVV. (aD) 
S) Venus returns control to the user at the end of the first 
phase. 


So, each responding server augments CVVi and the client sends this to all the servers as we just 
discussed. And in phase two, the tentative CVV is replaced by the final CVV. And so the aim is 
that by the end of the phase two, the AVSG has updated data; as well as it has the updated CVV 


for the data, and all the replicas pretty much have the same CVV after the end of phase two. 
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Design Details 
1@ Transtormation 


State Transformation — Force 


g 
/ 
@ Force operation - Transfer of file contents from a dominant 
” toa submissive site. 
‘@ Force of a directory is more complex. 
@ Lock and atomically apply changes one directory at a time 
@ Before creating a new entry, we first create a stub at the 
} server, It contains a CVV that will always make it submissive 


@ Subsequently, a force operation will change the status of the 
stub 


So, if there is a conflict, then the Venus file system the client file system manager does a force 
operation, where it tries to attempt a merge, particularly transfer file contents from a dominant to 
a submissive site. And so in this case, the submissive site can either be their client or it can be a 
server; so it does not matter, since we have a full version history. Whoever does not have the 
updates, that is a submissive site; and updates need to be transferred. However, doing a force 
operation for a directory is far more complex. So, we need to lock and atomically apply changes 


one directory at a time; and so an entire directory has to be locked and done. 


So, basically, this is kind of a hard operation; but, because, it is possible that a directory tree can 
be really deep and this has to be done recursively. So, that is the reason we do it one directory at a 
time; we do not do the entire tree at a time. So, if let say a new entry has to be created. What we 
do is we first create a stub at the server which is guaranteed to be submissive; because its CVV 
can pretty much contain all zeros. And subsequently a force operation will change the status of the 


stub, which means give it all the updates. So why is this done this way? This is very interesting. 


So, if we have a client and we have a set of servers, and let us say we are trying to create a new 
file in the directory. What I am trying to say over here is that we first create a stub in the directory 
for the file, which is pretty much a file with that name; but an empty entry with no contents and 
which is guaranteed to be submissive to every other update. So, once the stub is created, so this is 


so consider this as phase one. In phase two, what happens is that all the updates if the client has a 
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dominant copy that is are transferred to the stub. And there is a very similar process that is followed 


between one server and the rest of the servers. 


So, an astute viewer can ask that why not do both of them in the same go at the same time. So, 
what would that mean? That would imply that the entire contents are transferred to the server in 
one go. The problem is that if there is any node or network failure, then we might have the file in 
an inconsistent state. Or some of them will have an update and some of them will not have an 
update. It is possible that maybe this and this server have the contents of the file and this server 


does not. 


So, to stop that what we do is we just create a stub entry that at least indicates to the server that 
look a file with this name exists. And then in the second stage, all the updates, so creating the stub 
is very quick. These are very quick short messages that can be sent in a single RPC. Once all of 
them have it, the process of transferring the file can then be done later. And even if let us say this 
process gets interrupted, then at a later point of time if this server has all the updates; it will see a 
submissive entry and transfer the updates to this server. So, this increases the reliability of this, the 


robustness of this process. 
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State Transtormation 


State Transformation - Repair and Migrate 


@ A repair operation is used to fix inconsistent updates 


@ If we detect inconsistent updates, then the file is marked as 
inconsistent and moved to a covolume. 


@ All accesses to inconsistent objects fail 


So, we also have repair and migrate operations; so repair operation is used to fix inconsistent 
updates. So as we said, whenever an inconsistency is detected, the file is moved to a co-volume. 


And so in this case, they are stored there for the user to manually take a look; so all accesses to 
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inconsistent objects fail. And so basically, in this case, we wait till the conflicts are actually 


resolved. 
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Implementation 


@ Implemented on IBM workstations. 
@ 12 MB main memory, 70 MB Hard Disks 
@ Each server had 400 MB disks 
@ Uses the Camelot transaction facility for single site transac- 
tions. 
@ Uses the Andrew file system benchmark 
@ 70 files - 200 KB each 


So, coming to the implementation, this was implemented on IBM workstations. So, when it was 
implemented, 12 megabytes of main memory was a lot; but of course nowadays it is nothing, even 
smartphones have much more memory than that. But for those days, 12 megabytes of main 


memory and 70 MB of hard disks was all that was there. 


And so this is what the workstations had and the servers had 400 megabytes of disks. And the 
Andrew file system benchmark was used, which is essentially 200 KB, each 70 files. And there 
are a set of operations read, write, read directory, scan directories; it is a mix of operations. And 
for single site transactions, the Camelot transaction engine was used to ensure that operation on a 


single site happens atomically. 
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Cost with Replication 


Configuration | Time Overhead 
| No Replication (21% 
1ExtraServer | 22% 
} || 2 Extra Servers § 26% ) 
( | SExtraServers) 27% ] 


So, four configurations were studied no replication, which means there is no additional servers. It 
is like this is a baseline AFS overhead. So, the time overhead per operation just increased by 21 
%. So the nice thing about coda, which was really appreciated at that time; and this is why this 
paper remains a classic, is primarily because the additional overheads with the extra servers is a 


kind of minimal. So, it increases to by | %, 4 %, and 5 % respectively, with additional servers. 


And the main reason is that we use an RPC based mechanism; updates can be lazy, we maintain 
versions. So, the critical path in a certain sense is actually not LinkedIn. And since many of these 


operations are not really very network dominated; this does not turn out to be an issue. 
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Evaluati 


Benchmark Time vs Load 


@ For AFS the elapsed time remains roughly constant at.400 
seconds (1 to 10 load units). 

® For Coda the time increases from 400s to 650s roughly 
quadratically for 1 to 10 load units. A load unit represents 
the requests of 5 typical AFS users. 


Mn 


So, for AFS, the elapsed time for for different load units; so, where of course we are defining a 
load unit as the unit, the load generated by five users. So, basically, the total time for running the 
benchmark was roughly 400 seconds; and this increases to 400 to 650 seconds as we have seen 
across loads for 1 to 10 Load units, it increases quadratically. So, it is not a linear increase like a 


quadratic increase like that. 


So, as we increase the load, Coda does increase super linearly. And the main reason being that it 
puts a higher load on the network synchronization, server management, version management, 
migration, et-cetera. But nevertheless, this is a very scalable file system; and this is why this paper 


has remained a classic. 
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: The network load in terms of packets in 
creases linearly from 5,000 to 60,000 while varying the load 


units from 1 to 10. 


: For the same range of load units the network load 
increases linearly from 5,000 to 40,000. 


Then, we will discuss the effect of multicast and unicast. So, if unicast is used in an iterative 
fashion, the network load in terms of packets increases linearly from 5000 to 60,000 across lower 
units. And for multicast, of course the number of packets reduces. And I would advise viewers to 
take a look at the paper; so, the paper also talks about latencies and so on. So, in general, while 
designing a distributed system, it is a good idea to use the multicast and broadcast capabilities of 


the network. 
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Coda: A Highly Available File System for a Distributed Work 
Station Environment by Mahadev Satyanarayanan, James 
Kistler, Puneet Kumar, Maria E. Okasaki, Ellen H. Siegel, 
and David C. Steere, IEEE Transactions on Computers, 
1990 
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So, this was the reference IEEE Transaction on computers 1990, a very old paper. But, this paper 
has inspired successive generations of distributed file systems, and many of today's file systems or 


much of their origins to the Coda paper. 
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Smruti R. Sarangi 


Department of Computer Science 
Indian Institute of Technology 
New Delhi, India 


Smruti R. Sarangi Dynamo 


In this lecture, we will discuss the Dynamo system. The dynamo system powers many of the 
services within Amazon. So, this has been documented very well in their paper. So, we will discuss 


the dynamo system, which is essentially a key value store, it is a DHT, in this lecture. 
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| Outline 


Q Motivation 
@ System Architecture 


@ Evaluation 


‘Smruti R. Sarangi Dynamo 


So, we will have three main sections in these slides, the motivation, the system architecture, and 


the evaluation. 
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Prerequisites 


@ Distributed Hash Tables: Chord and Pastry 
@ ACID Guarantees 
@ Eventual Consistency 


‘Smruti R. Sarangi Dynamo 


So, we have a couple of prerequisites before actually leading or viewing this lecture, or reading 
the slides or viewing the lecture. The prerequisites are that the viewer should have an idea of 
distributed hash tables, in particular, the chord and pastry algorithms. Without knowing these 
algorithms and without understanding how a distributed hash table works, it will be very difficult 


to follow the concepts in this paper. 
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Second, the viewer has to be aware of the ACID consistency, the ACI guarantees. So, ACID stands 
for atomicity, consistency, isolation and durability. So, these four concepts in the world of 
databases, they should be known to the viewer. And finally, the viewer should have some idea of 
consistency models in distributed systems, notably eventual consistency. Eventual consistency 
refers to a model where a right is ultimately visible. So, all rights are ultimately visible. This is 


eventual consistency. So, these three concepts should be known before proceeding forward. 
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So, the background is like this that reliability was one of the biggest challenges for a large e-tailer 
like Amazon. So, in particular, Amazon aims at 99.999, so as you can see there are five nines over 
here, two before the decimal point and three after the decimal point. So, Amazon aims for such a 


high reliability. Well, so there are many reasons for it. 


The first is that even a small downtime, particularly in the peak season, which is Christmas in the 
U.S., it is Diwali in India, so in such cases even a small downtime will essentially take away users. 
This will lead to a large loss in net revenue. And additionally, also the e-commerce site is not the 
only site that you actually, that actually is in Amazon. Amazon does many more things. So, 
Amazon has sites for web services, Amazon has a lot of sites for all the vendors that keep uploading 
the details of their products, Amazon has internal sites for payment to the vendors, for ads, for a 


lot of things. 
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So, in this case, given that Amazon is involved in so many things, there is a need to have a very 
high reliability. And so the infrastructure which consists of thousands of servers which are 
commercial of the self-servers, the server components, network components keep failing. So, the 
customers basically need an always on experience, which means that the platform that they are 


using should be always on, should we always active and always responsive. 
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So, for this purpose, Amazon needed a highly available key value store. And of course we have 
seen pastry and chord, we have seen the reliability guarantees that come along with pastry and 
chord, so they are great, but of course they are not that great that they can be used in a commercial 
system of the size of Amazon. So, what are some of the services that Amazon would like to 


provide? Well, some of the services are a best sellers list. 


So, for example, I was just doing it right now that currently when this video is being recorded, we 
are in a lockdown in India. And so there was some news that e-commerce might open today, which 
it has not, but at least what I was trying to do is I was trying to see what are the new best sellers in 
the world of electronic gadgets. And thankfully, many sites like Amazon have a best sellers list 


which I can check. 


Then, of course, shopping carts, which are very important, so these are digital shopping carts, 
where whatever I want to buy, I add them to a shopping cart. Then we can have customer 


preferences. So, customer preferences are things that are bought before. And also you would have 
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seen that Amazon suggests a lot of products that I can buy based on my sales, based on things that 


I have bought before those kinds of things. 


Sales rank, so the ranking of products, if let us say I want to buy a DVD player, it will show me 
the rank of the products and the product catalog, of course. This is something that many of the 
vendors see when they are uploading products. So, as I said, across users, across vendors, there 
are a lot of things, a lot of services that Amazon typically provides. And also these services keep 
on increasing. So, this paper is an old paper published in 2006. And after that Amazon has grown 


many, many times. 


So, even in 2006, Amazon was still very popular. It served 3 million shopping checkouts in a single 
day. And it was managing thousands of sessions. So, the way that it was doing it was that it just 
had DHT. So, if you would recall from our discussion on pastry and chord, the DHT is a structure 
which can expand, which basically means that at the time of peak demand, let us say at the time 
of Christmas, what can happen is that we can take an additional data center and the DHT can sort 
of expand and the keys can get redistributed. So, the DHT can grow and shrink based on the 
demand, which is one of the key advantages of the DHT. 
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So, of course, here the assumption and requirements are that it is a large system with a simple 


semantics. So, we will provide a simple key value access. ACID properties, well, in a large system, 
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A, Land D, which is atomicity, isolation, and durability, these three properties have to be provided. 


So, Amazon does kind of relax the consistency requirement we will see how. 


And furthermore there has to be some service level agreement, which means that, so here it is 
important for me to mention the dynamo from the point of view of Amazon is an internal service. 
It is not an external service, but it is a service which is internal, which means that other services 
within Amazon they use what dynamo provides. So, service level agreement is required even 
between Amazon and third parties and between one branch of Amazon and other branch of 


Amazon. 


So, in this case, so the typical service level agreement, the way that it looks, talks about, let us say, 
the mean latency, that as a typical thing or maybe the maximum. So, Amazon, so what they went 
for, very interesting reasons are given in the paper. So, they looked at the 99.9th percentile, which 
to them is important, which clearly tells them that look 99.9%, which is one in thousand, so they 


look at 99.9th percentile, and 99.9% of all accesses have to satisfy the SLA. 


For example, the SLA prescribes a certain maximum latency. Then it is clear that 999 are of 1000 
requests need to satisfy that. And if there is a maximum client request rate, then that also needs to 


satisfy the SLA with the same 99.9% criteria. 
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So, as we have discussed in a single Amazon page that we will see, if you search for something, 
is actually a very, very complicated process. So, the way that Amazon actually works is that it 
calculates, it gets the client requests. So, they go to, of course, a multitude of servers. The servers, 
what they do is, once they get the client requests, so the client requests can be anything. It can be 
a search query and the client might want to see the list of products that Amazon has or the client 


might want to just update his or her shopping cart. 


So, there are a set of servers called aggregators. What they do is that they collect and collate 
requests. So, they, in a sense, collect and collate and aggregate these requests. So, they actually 
route the request to multiple dynamo instances. And each of these dynamo instances manage 
different things. Maybe one manages the shopping cart. One instance of dynamo might manage 
list of best sellers, might work as a recommender system. So, they have different instances. So, the 


different instances give back the results. The aggregator aggregates them. 


And so then what happens is if we consider a typical search page, so Amazon might have a list of 
search items over here. It might have this spot which is like best sellers in this category. It might 
have another part of the page, which is this, which could also be customers also bought. So, we 
also see that, that customers also bought similar items. And then of course here there will be details 


of the shopping cart. 


So, pretty much each of these pieces of information are being produced created by different 
dynamo instances and these dynamo instances produce this information, which is subsequently 
aggregated by these aggregator servers and then a web page is created. And the webpage again 


comes back to the client. And the client can see the structure of the web page on the browser. 


So, it is important to understand that a single, so whenever we click a button on Amazon, we see 
a web page. And this web page is actually being produced by a very, very complicated process, 
where a lot of information is being collected from different sources within Amazon. Everything is 


brought into the same webpage and shown to the client. 
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So, what are some key important broad principles that we would like to have? The first is 
incremental scalability, which means it should be possible for us to increase the size of the DHT, 
one node at a time. And so this is needed. So, the key property of the DHT is that it can grow and 
shrink. So, the reason we say it is a very good property is because essentially, when at the time of 
high load, we can take cloud computing resources, and we can add it to the DHT. And at the time 


of low load, these resources can be released. 


So, then Amazon can, for example, give its compute power off to individual users, where they can 
use it for, something like Amazon Web Services. They can use it to do their assignments, run 
projects, and so on. So, a DHT is gives us a certain amount of elasticity. And the elasticity is a 
key, a very important property of DHTs. Then of course symmetry, which means all the nodes do 
the same thing. We have already seen in pastry and chord that this is the case. This is indeed a peer 


to peer system. So, this is a type two DHT decentralization. 


Heterogeneity, which means that the heterogeneous capabilities of servers that also needs to be 
exploited, because one thing you need to understand that whenever such the DHTs are created, 
which are very large, which can consist of tens of thousands of nodes distributed across many, 
many data centers in the world, all the nodes will not have the same compute capability. Some 
might be older processors, some might be newer processors, or might be really small ones, big 


ones, even CPUs, GPUs, FPGAs, all kinds of nodes will be there on the DHT. 
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So, essentially ensuring that all of them would actually execute at the same rate is, at the same rate 
with the same service level guarantees is not a wise idea. So, we should be aware that heterogeneity 
exists. And finally, an eventual consistency, which means two things, in this case, particularly that, 
so recall that in a DHT we have two operations put and get. So, a get operation returns a set of 


versions of the same variable. 


So, put, of course, as a key and a value. But given we have eventual consistency, which means it 
is possible that different nodes will be seeing different values because of the weak consistency, 
which means that there will be multiple ones of these with their separate versions. So, the get 


operation will fetch all of those versions. 


So, if you would recall, we were doing something very similar when we discussed distributed file 
systems, like coda. So, in that case, also we were fetching the different disparate versions that are 
there, and then there was a merge. And this is something very similar to a distributed file system. 
So, I would request the viewers to take a look at the coda distributed file system, which is a part 


of the same playlist. So, that discusses the versions issue in more detail. 


And the other is, of course, as we have discussed, a write ultimately succeeds. So, it does not take 
infinite time for a write to actually happen. So, it ultimately happens. But then of course multiple 


versions have to be created. And we are explicitly aware of that. 
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So, there are, as I just mentioned, there are two key operations get and put. So, get returns all the 
values associated with a key with additional information. All of that information is referred to as 
the context. And similarly put inserts a key value with a given context into the database. So, the 
context thing will be clear. It is actually a set of vector clocks. But this will be clearer as we keep 


on discussing. 


So, the important conceptual things over here is that writes need to be very fast. Writes need to be 
very, very fast. So, this is a design decision. And we will see why. And reads can be slow. Writes 
have to be fast, reads can be slow. And other is that we never lose a write. So, in the priority order, 
we have essentially increased the priorities of writes significantly. And for reads, we have kind of 
deprioritized them. So, the reason why most e-commerce sites do it, in fact, most customer-centric, 


customer-friendly, client-friendly sites actually do it, the reason is rather clear. 


The reason is that, so consider a shopping cart. In this case, we want to get the customer clicks 
customer preferences as soon and as quickly as possible and just record them. So, which basically 
means that, let us say, I have a browser of open over here and I have multiple tabs. This is tab one, 
this is tab two, and kind of Amazon is open in all of those tabs. So, here I click something in this 


tab, then I click something in this tab, then I click something in this tab. 


So, Iam basically adding things to the shopping cart. But of course they are happening via different 
tabs. So, we never want to lose or overwrite any of my preferences. And I might close the browser 
in the middle, I might then open up my Amazon site on my mobile on my mobile phone. So, 
whatever I am doing, I never want to lose any write, which means I never want to add something 


to the shopping cart or I never want to change my, forget my preferences. 


And the other is that always we want to record a write. So, writes are not on the critical path. But 
that is the reason we want to quickly record them. And when we do a read, we will be well aware 
that there are many different versions floating around, because we just quickly recorded writes 
without synchronizing. So, at that point, some version management and enforcing some 
synchronization and consistency for all the writes that we have gathered that is important. That is 


why in this case reads are slow and writes are fast. 
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So, how does this work? Well, the way that this works is that this uses consistent hashing, which 
is similar to chord to distribute keys a circular space. Similar to chord each item is assigned to its 
clockwise successor on the ring. Furthermore, what chord did was that on each physical node, it 
actually kept multiple virtual nodes. In the sense what it actually did is that it created, it assigned 
a host of positions on the ring. And each physical node was given a set of contiguous virtual nodes 


for the sake of load balancing. 


So, we will see that the virtual nodes are used for in a conceptually similar manner over here, but 
of course not in this way, in a different way. But we just need to keep in mind that a physical node 
is responsible for multiple virtual nodes, which means multiple positions on the ring. So, they were 


contiguous in chord, but they need not be contiguous over here. So we will see. 


So, a fault tolerance what we do is that the key is assigned to N successors called the preference 
list. So, if this is the part of the ring and these are all our successor nodes. And in a sense if our 
key is mapping over here, so we assign the key to N successors, also known as the reference list. 
So, in this case what happens is that there is a degree of redundancy in our system, where N 
successors all have the key and the value such that if any one of them is down, then of course the 


other ones can supply their data. 


So, furthermore, what dynamo does is that it tries to distribute them across different physical 


locations where these physical locations are on different data centers. So, of course, now if you 
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actually look at the key space, I am just drawing a part of it, so it is pretty much possible that 
maybe these two nodes are on the same data center. But then what dynamo would do is that it 
would skip the second position and instead consider this as the second successors, the first 


successor. The next one, which is a different data center, is the second successor. 


So, essentially, when we are talking of N successors, that N successor is located in different 
physical nodes, which are preferably in different data centers. So, N successors should not be 
interpreted as N successors on the ring, rather they are successors on the ring, the closer successors 
on the ring. But of course, if any two of them are in the same data center, we kind of skip the 
second, the second, third and so on such that we want to ensure that all N of them are physically 


located at different places. So, this is what gives us the fault tolerance. 


And one among these N successors is the coordinator node, which is primarily responsible for the 
key and value. So, one among them in this preference list, so this is where the key maps, this is the 
preference list, one among them might be the closest one also is the coordinator. And the rest just 


contain replicas of the key value data. 
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So, what is the idea? Well, the idea is that the consistent hashing algorithm in chord has, still has 
some issues with regards to non-uniformity of data. And why is this? Well, the reason is that on 
the ring you will have, even though you are using a very good hashing algorithm, you will have 


regions which are more populated with keys and regions which are less populated. This is reason 
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one. Along with this there is a reason two that the nodes themselves have heterogeneous 


performance. They are not of the same type. 


So, some are new CPUs which are fast. Some are old CPUs. We can have ultra-fast FPGAs with 
very high throughput. So, of course, latency might be low, but throughput will be very high. We 
can have other kinds of futuristic hardware, custom accelerators, and so on. So, what we actually 
do is that each node is assigned to multiple positions. These multiple positions are also called as 
tokens on the ring. And each range of keys that position is responsible for, which means it is a 


successor Off, is assigned to a virtual node within the physical node. 
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So, let me see I have a small figure over here that explains the same concept. So, what we have 
here is we have a ring. We have a ring where all our hash functions are distributed. So, given the 
physical node P1, as we have just described, we assign it to multiple positions, also called tokens 
in the ring, which means that P1 is assigned to different virtual nodes, virtual node V1, V2, V3 


and V4. 


So, let us now assume that for V3 the node that is closest to it, the virtual node that is closest to it 
on the ring is V1 dash, which corresponds to a physical node P2. So, for this region that you see 
over here, this region of the hash space, this is the portion of the hash space, the key space that 
node V3 is responsible for. So, what we have essentially done is that we have taken the key space, 
which is a very, very large circle, and in the key space we have assigned multiple positions, 


multiple tokens to each physical node P1, each of these positions is a virtual node. 


Furthermore, what has been done in this case is that the key space has been implicitly partitioned, 
which is a chord like partitioning, because any key that maps to this region over here is logical 
successor in this case is V3. And of course, since we store replicas of this key, any key let us say 
over here, we store replicas, they will be stored in the N replicas that succeed V3 and subject to 


the fact that these replicas are physically stored in different places. 


And so then this will give us a degree of replication. And also the fact that we are actually mapping 
each physical node to random positions on the ring will also give us a degree of uniformity and 


load balancing. So, this is the key difference between dynamo and chord. 
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Let me just, data versioning. So, it is possible that a put call, when I am trying to put key or value 
on the key space on the ring, it might return before the update has propagated to all the replicas. 
So, this is of course possible that we have N replicas which are physically in different places. If I 
send, and let us say that this is the coordinator for a certain key, so I send it the key and value, so 
it needs to be sent to the rest of the replicas. So, of course, note that this N includes V3 also. So, it 
needs to be sent to the rest of the N - | replicas. And this takes time, because you are talking of 


physical network messages that need to be sent by V3 to the remaining N - | replicas. 
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So, in this process if there is a failure, some network failure, node failure, then some replicas might 
get the update after a very long time. So, this is where your eventual consistency plays a big role. 
So, we consider a put to be successful if at least one of the servers within the preference list records 
it. It is the job of this server, which is typically the coordinator, to broadcast this to the rest of the 
servers on the preference list. And this, there is no time limit, there is no associated timing 


guarantee associated with this. Hence, we see it is eventually consistent system. 


So, some applications such as add to shopping cart is where a write, which means that I want to 
buy some item, that needs to always complete and always needs to complete quickly. So, we have 
discussed this before and this is exactly why we prioritize writes. Each new version of data that 
we create, so let us say consider a certain key k and then it can take a succession of values, so it is 
a general key value store, it can take a succession of values, where if k is the shopping cart as we 
are adding or deleting new entries, and V1 is the contents of the entire shopping cart, it is taking a 
succession of values. Each one of them is considered a new and immutable, unchangeable version 


of data. 


So, what we do is that we do not actually overwrite. We just create a new version and record the 
fact that V1 was the first version, then V2 and then V3. We will see that this simplifies our life 
significantly. Now, what happens is that if there are failures and concurrent updates then a certain 
version branching may occur, which means after V3 some might record a V4, some might record 


a V5. So, of course, this should not happen, but this will happen if there is a failure. 


So, at the time of reading data or at the time of final checkout, there is a need to actually merge 
V4 and V5. And this merging can be done in two ways either generic logging which means at the 
server side, is a general way of merging. So, one way of merging shopping carts is that look, if I 
added two items at different points in time and I have not removed them, then maybe I want to buy 


both, so I just can merge them. 


However, as we have seen, so this is very similar to what we had discussed in coda. So, in coda 
what we had seen is some of these updates are not possible to merge. So, it is possible that in one 
server records, let us say add an item, another one records remove the same item. What do we do? 
If they are roughly in a same window of time, we are never sure customer actually wanted to add 


it or add it and then remove it or what was the intension. 
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And so basically this can further get complicated. Of course, let us assume, previously there was 
one add item and then we add one more, so the relative timing of add and remove can make it 


complicated because we would not know which one came first, which one came later. 


So, in some cases, it can matter. In some cases it will not matter. But in these cases, it might be a 
good idea at the client side browser actually tries to resolve the issue on its own, particularly if 
there is an issue that there is conflicting information and this cannot be resolved automatically on 


the server side. 


(Refer Slide Time: 32:56) 


System Architecture 


| Vector Clocks for Versioning 


@ A vector clock, contains an entry for each server in the pref- 
erence list. 


@ When a server updates an object, it increments its vector 
Clock. 


(0. If there are concurrent modifications, then a get operation 
) returns all versions. 


@ The put operation indicates the version. 
@ The put is considered a merge operation. 
@ Example = 


‘Smruti R. Sarangi Dynamo 


So, the way that we manage versioning is that we use a vector clock. A vector clock contains an 
entry for each server in the preference list. So, whenever a server updates the object, it increments 
its vector clock. So, this is very similar to CVV in coda. And if there are concurrent modifications, 
then what will happen is that a get operation will return all the versions which it should and after 


that some kind of a version merging reconciliation is attempted. 
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So, this is how an example would look that let us say there is a write operation which is handled 
by server A. So, the first vector clock v1 would look something like this that this is version 1 
created by server A. Then again if there is another write that its handled by server A, it will just 
increment the vector clock and make it A2. After this it is possible that some failures or some 


branching may occur in the network. So, it is possible that a write is handled by B. 


So, in this case, the vector clock would be A2, B1 and we can have a simultaneous branching 
where the write is handled by C. In this case, it will be A2 and C1. Then of course, there is a need 
to reconcile the writes. So, when we reconcile the writes, we reconciles the vector clocks as well. 
And if A is doing it, it will increment A2 to A3, this we can see over here, and the rest of the entries 
just get copied. So, this basically means that any subsequent version will have to emirate from here 


and this version has essentially reconcile all the parts. 


So, using vector clocks for reconciling conflicting data, for reconciling divergent data, is a 
common pattern in distributed systems and this is almost, so we will see that this is a common 


pattern, it is a recurring pattern and almost all such systems use it to varying degrees. 
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So, how do we exactly do a get and put? Well, so similar to pastry, we send a request to any node 
that will forward it to the coordinator. So, basically if we do not know the coordinator, we try to 
find the coordinator or given the key we try to directly find the successor of the key. Then the 
nodes would ideally access the entire preference list and send or receive data from entire preference 


list. So, the preference list in itself is organized in a kind of special way. 


So, we define a read quorum and a write quorum. So, quorum is like an extension of the notion of 
a majority. So, let us say that if you consider any voting system, if a majority of a nodes agree for 


doing something, then we say that that is a decision of entire group. So, in this case, when I am 
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talking of a read quorum of R nodes, this basically means that I read the data from R nodes and 
that is sufficient to tell me what the data is. And if there is a write quorum of W nodes, it means 


whenever I do a write, I need to send it to W nodes. 


So, the classical equation that we hold in all, that we, that holds in all quorum systems, is R + W 
> N. So, we will see why this is the case. So, in this case what happens is that, if let us say consider 
the fact that I am writing and I write it to W nodes. If I write it to W nodes, when I am reading, I 


should at least reach one of these nodes, and so if I reach one of these nodes, I will get the update. 


So, what is the probability, so when will I not get the update? I will get the update only when I 
reach one of the other nodes, which is N - W. But if R > N - W, then it automatically means that 
in my read quorum there is at least one server, which is also there in the write quorum. And since 
I am reading from it, I am guaranteed to get the most up to date version of the data. So, given R > 


N - W, we can automatically infer R + W >N. 


So, dynamo does use a quorum system in the sense that all N servers of the preference list are not 
all used in the same way. So, we can designate some as read nodes and some as write nodes as 
long as this condition holds. So, for a put request the coordinator merges the versions, if it can, 
and broadcast it to the quorum, the write quorum. For a get request, the coordinator sends all the 
concurrent versions to the client, and then the client needs to do a merge. So, it can be the client 


or it can be some other Amazon node on behalf of the client. 


So, there are many ways of actually looking and thinking about this. So, but, I mean, as I said, 
there are many different kinds of complicated logics that can be implemented on a dynamo system. 
And so we do not want to constrain it in any way. So, of course, as I said, if the server or somebody 


else can automatically merge the conflicts, it is the best. Otherwise, the client needs to do it. 


So, let me come back to the example that I had in the previous slide. It is kind of a tricky example. 
So, a little bit of re-explanation might be necessary. So, the one of the examples that I had said 
over here is this add and remove item thing, which I would like to maybe just use some of the free 
space over here and issue a quick re-explanation or quick clarification so to speak. So, let us 
construct an example which need not be all that practical, but just kind of will give you an idea of 


the kind of things that we need to deal with. 
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So, consider the shopping cart and let us say that I added an item. So, let us say, I add a phone. So, 
this phone is the same model. It has a 10% discount, and then I add another phone same model 
with a 30% discount or it is, so let me do that, and then I just say remove phone and I do not specify 


which one it is. 


So, if I do that, so if you were to actually see we can assuming we have this logic, so of course 
there are many ways of ensuring that this does not happen in a sense we can give both the phones 
different internal IDs, and the remove would refer to one of the specific internal IDs. But let us 
assume that we have an implementation, where we are just saying that I had phones. And then I 


just want to like to remove one of the phones with this model number. 


And so, this can happen not from a client side, but maybe from an internal Amazon user site, 
maybe one of the phones suddenly went out of inventory so that would get removed. So, now the 
question is which one? So, let us say that this happened, this was version A and then a version 
branching took place, so then this became B and this became C. So, we would not exactly know 


the relative ordering between B and C. So, if C comes first maybe this B will be removed. 


If C comes first A will be removed this one and if C comes later B will be removed. So, in many 
cases for concurrent events we are not sure of the ordering and that is why automatic merge 
followed by a manual merge is required. And so this merging can lead us to very tricky complicated 
scenarios. So, we have already seen how to handle them in coda. So, in Amazon, in dynamo, the 


philosophy is slightly different. 


So, the philosophy of dynamo is that we send all the concurrent versions to the client. And the 
client can either be the actual user or some internal Amazon node which is between dynamo and 
the browser and then the browser can use its own logic or the internal node can use its own logic 


to resolve these conflicts in some way. So, that is not really the lookout of the DHT. 
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So, furthermore, dynamo defines a new concept called a sloppy quorum. So, the sloppy quorum, 
it uses the first N healthy nodes, which is typically the preference list. So, assume that a node 
cannot deliver an update to node A. So, then what it does, so let us say, we have this part of the 


ring and then this is the coordinator. It wants to send an update to node A, but it cannot deliver it. 


So, in this case what it does is that it actually delivers the node to another node D and then it ask 
D, it drops a hint to D that once A recovers, D will transfer the update to A. So, this is known as a 
sloppy quorum, because this is not an exact quorum, because the coordinator will still record the 
fact that A got it even though A actually did not, but we can think of this as some form of a relaxed 
quorum. For A need not get it immediately, but at a later point in time some other node which has 
better access to A can transfer the object to A and kind of give the coordinator a promise back that 
it will do it. Of course, for additional added reliability, this quorum should span data centers, such 


that our system is as reliable as possible. 
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Now, let us discuss more about how we do synchronization across replicas. So, recall Merkle trees, 
so Merkle trees we had discussed in our lecture on Bitcoin and blockchain. So, just a quick recap 
of Merkle trees. So, this is a regular tree. So, the leafs are the ones that actually contain real data, 
then each internal node, each parent just contains the hashes of its children. So, each of these 
internal nodes just contains the hashes. And finally, the root of the tree contains the hashes of its 


immediate children. 


So, if you think about it, if I change the value in any leaf, all of the hashes till the root will change. 
So, the root in a sense contains a gist of all the data that is contained in the entire system. So, if I 
just re-compute the Merkle tree and just compare the contents of the route to the Merkle tree, that 
is equivalent to containing, that is equivalent to comparing the contents of the entire data that each 


tree represents. 


So, in this case, the Merkle tree contains the set of all the keys that are mapped to a virtual node. 
So, it essentially represents a range of keys that a virtual node stores. So, nodes regularly exchange 
Merkle trees via an anti-entropy based mechanism. And so why is that done? Well, the reason that 
is done is that let us see if I have the coordinator over here. And then I have a set of replicas, which 
also contain the data for this, the data for this zone. So, it is possible that over time because of lost 


messages in the network, the data can go out of sync. 
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So, in this case, if the data goes out of sync, then there is a problem. So, that is the reason for this 
zone, the entire data is sent via an anti-entropy kind of mechanism. So, nodes regularly exchange 
Merkle trees to ensure that the still, the replicas are consistent with the coordinator. So, trees often 


have to be recalculated. And if there is a discrepancy, the data has to be merged. 
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So, how does dynamo maintain membership? So, recall we had discussed virtual synchrony in a 
previous lecture, where we talked about view changes. So, we do have something here which is 
more or less conceptually similar. So, dynamo maintains membership information through explicit 
join and leave requests. So, as compared to other systems where it is far more common, in dynamo 


particularly this is a slow operation. So, ring membership changes are infrequent. 


Additionally, a gossip based protocol, which we have seen different forms of gossip, exchange 
ring membership information across randomly chosen nodes. So, now, why is this important? 
Well, the reason it is important is that in dynamo the routing tables are massive, there are huge 


routing tables, where we aim for pretty much 1-Hop routing. 


So, what I mean by 1-Hop routing is that unlike chord and pastry where it is a login step routing, 
here in dynamo it is a single hop. What that means is that I have very large routing tables and I try 
to reach the node immediately in one step. So, all routing membership and placement information 
propagates again via an anti-entropy based gossip protocol among all the nodes. And they try to 


keep the membership information roughly consistent. 
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Of course, it is possible that the logical partitions in a ring can get created because of lost messages, 
so which basically means that these nodes have no idea about nodes that are in another partition, 
so this information got lost. So, we have some seed nodes that act as global directories, which have 
exact information about each of the nodes in the network. So, you can think of the seed nodes as 
kind of like global aggregators of information of when a node joins and when a node leaves. So, 
these global seeds can be consulted if nodes feel that there is a partition, and that can be used to 


reconstruct the parts of the ring. 
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So, failure detection is also done via gossip style protocols where we have messages and 
acknowledgments and heartbeats. So, if you feel that a node is unresponsive for a long time, a 
node, the other nodes can declare it to be fail and exchange this information via gossip. So, of 
course, here, the assumption is that epidemic and gossip algorithms are known to the viewer. If 
the viewer does not know about them, then they can look up the course webpage and so they will 


find the slides and a video on this is also coming up. 


So, node allocation and removal happens in the same manner as chord. So, I am not going over it. 
And so since it us very similar when a node fixes its own routing table and also adds itself to the 
routing table of other nodes and same is true for removal as well. So, here, since keys are replicated 


at the successors, when a new node is added, some of the data is moved from successors to the 
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new node. So, some data movement does indeed happen. So, that is why node addition and 


deletion, particularly in this setup with 1-Hop routing. 


So, there are two things that actually make dynamo kind of heavy and expensive. One is 1-Hop 
routing. The main reason was that in Amazon when we are browsing it, when we are just searching 
for new products and we are adding it to our catalog, adding it to our shopping experience, we 
want a very quick response from the website. That is the reason 1-Hop routing is required in 
dynamo. And the other is of course that we maintain N replicas. So, both of these things make 


things slow and make the protocol heavy. And so that is why nodes need more storage. 
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And once when nodes need more storage, then what happens is that any addition or deletion means 
you have to move in a lot of data. So, given the data aspect in mind when dynamo was actually 
evaluated, they compare it with several kinds of, they actually implemented it with several kinds 
of storage in, several backing stores and in memory buffer, Berkeley DB which is meant for storing 
very short files, and a traditional MySQL database. And the entire system was written in Java with 
Java NIO channels, which are very fast IO channels, very fast implementation of IO channels in 


Java. 
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Let us now look at the response time for reads and writes. So, in the peak season of December 
2006, the average read time had a diurnal variation. So, the time period a variation was roughly 12 
hours. So, what was seen in this 12 hour period is that the read time varied from 12 to 18 
milliseconds. So, an astute reader can ask a question that our system was designed to make writes 
fast, but how come the average write time in the same period is 21 to 30 milliseconds, but the 


average read time is just 12 to 18 milliseconds. 


Well, the answer lies in the fact that the writes need to be persisted to durable storage such as the 
disk in at least one server, in at least one of the replicas in the preference list, the writes need to be 
persisted. Persisted means written to stable durable storage. And this takes time. In comparison, 
the read is much faster, because most of the data for a read can come into the cache. But the write 
needs some durability guarantees, because at least the A, I and D properties hold in our system. 
So, that is why writes were still slow, but not that much slower, because we do so much to speed 


up writes. 


So, these were the mean values, the 50 percentile values, but the 99.9 percentile values were 
roughly 10 times more. So, this much of inherent variation was there in the system. Nevertheless, 
100 milliseconds or 200 milliseconds is not much. And the user is willing to wait for seconds to 


get the answer. So, this is why the system works and it is reasonably fast and responsive. 
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So, another thing that was compared was buffered writes as compared to direct Berkeley database 
writes. So, buffered writes are writes where we write to the memory. So, recall that the numbers 
are 21 to 30 milliseconds, and 99.9 percentile values were 10 times more, which means roughly in 


the range of 200 to 300 milliseconds. 


But if we were to consider in-memory writes, where we just write it to memory and gradually 
persisted to disk, and so in that case the 99.9 percentile values were far lower between 40 and 60 
minutes seconds, which is a reasonable idea if we have some reliability guarantees where we know 


memory is not going to fail. 
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And for Berkeley database writes, were a Berkeley database is a database tuned for small files, so 
in that case also it was between 40 and 180 milliseconds. So, the write time was nevertheless still 
high if you are not using a full in-memory solution. But, of course, nowadays the technology has 
changed. So, those days flash memory was not there. But with flash memory, which is a non 


volatile storage, these numbers should have become much better. 
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So, as we had discussed, there are two kinds of reconciliation methods. One is we can have a 
business logic based reconciliation. So, in this case, like a shopping cart, well, we know exactly 
what is to be done. If we have multiple versions of the shopping cart, we merge them or for many 
other services, we can just have a timestamp based reconciliation, which was there. In, as we had 


seen in the coda file system, it was there with AFS. 


So, in this case, the last write wins, which means that for the same object whoever is the last writer 
is the one who finally decides the state. So, of course, in this case, there is no merging, so 
something like a customer session management, where we are managing the session of a customer 


falls into this category. 
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So, the last point that we need to discuss is that in a circular ring, for a physical node there are 
multiple virtual positions. These are called tokens. So, how do we place the tokens? So, the first 
approach is that we randomly placed the tokens in the ring. So, then what happens is that if this is 
the ring, this makes a node responsible for random portions of the key space, which means that if 
a node is over here, then the key space between this and the previous node, all of this is mapped 


to this node. 


So, any node or addition, of course, in this case, is expensive, because the key value data has to be 
migrated and we need to re-compute Merkle trees. So, this is the baseline vanilla scheme that we 
have been discussing, which is pretty much that, we just place tokens at random nodes for the same 
physical node. And then the regular chord like hashing scheme is used, the regular chord like 


mapping scheme where every node is mapped to its successor that is used. 


The other approach is something like this. We divide the hash space into Q equally sized partitions. 
So, then what happens is that, so the hash space is divided into two equally sized partitions. So, 
partition is placed on the first N nodes, which means that we split the job of partitioning this key 
space and assigning the keys to successes. So, in this case, if I were to consider this partition of 
the key space, and let us say if these are the positions of all the nodes, then for a given partition I 


will assign it to N successive nodes on the ring. 
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So, basically, if we have this partition over here, any key falling in this region is kept here, also 
here, also here, these are its reference lists. So, one advantage of separating the tasks of partition 
and placements, if I were to kind of draw a zoomed in view of the same figure, and let us see if 
this is my partitioned key space, so if this is how my key space is partitioned and let us say this is 
where my nodes are, my virtual nodes are, then for this region of the key space, it is assigned to 


the N successive nodes on the key space. 


So, this is a better form of load balancing, where, of course, we have kind of discretized, we have 
quantized, the key space, and we have also ensured a certain amount of load balancing for each 
virtual node, which means that it is not the case that in a single virtual node is responsible for a 
very large portion of the space. So, that has not happened. Of course, there can be sparse regions, 


but then nevertheless this is still a reasonably effective idea of trying to balance the load out. 


But of course the main problem with Strategy 2 as compared to let us say Strategy 1| is basically 
that it is possible that the distance between two nodes is actually a lot. So, there is some sparsity 
in the way that the nodes are mapped. In that case, all of these partitions, key space partitions 
between these will actually get mapped to this node and that is a problem. So, given that that is a 
problem, this will still cause some load balancing issues. So, that is the reason we have Strategy 


3. 
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| Token Distribution 


So, in this case, what we do is that we divide the hash space and the key space is the same, we 
divide it into Q equally sized partitions. So, then, so this is something that we were doing in 
Strategy 2 as well. So, here S is the total number of nodes. So, but in this case, our load balancing 
is explicit. So, this is a slow scheme, but this gives the best runtime performance. So, in this case, 


if there are Q tokens and S nodes, the average number of tokens per node is Q/S. 


So, what we essentially do is that whenever a physical node joins, we assign it random tokens. So, 
that we do. So, let us say, this be one position, another position, another position, another position. 
And essentially, we steal tokens from other nodes such that this condition holds. On an average 
roughly this condition holds, which means that Q/S tokens are held by each node. So, this is a very, 
very explicit form of load balancing, where all the tokens are the same size. And the number of 


tokens held by each node is roughly exactly equal. 


So, Strategy 3 is clearly expensive. But this is appears to be the best we will see when we look at 
the results. Strategy 2 is something that takes us to its strategy 3. And Strategy 1 is, of course, the 
random default scheme. So, if I were to define the efficiency as the mean load per, mean load is 
the mean number of keys stored per virtual node divided by the maximum, so Strategy 3 would be 


the most efficient with an efficiency of 99%. 


The next would be Strategy 1, which is again purely random, which is 95%. And the last is Strategy 
2 with 83%, because here we did discretize the key space. So, Strategy 2 was the one that 


introduced discretization in an aim of doing load balancing, but it did not really take care of sparsity 
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very well, which the random schemes actually did. So, in this case, essentially, Strategy 3, which 


does an explicit load balancing, is the best. 


So, of course, Strategy 3 is also rather expensive when it comes to node addition and deletion. 
Because when we add a node, we have to steal tokens from other nodes and give it to the new one. 
And while deleting we have to follow the reverse process of taking out virtual nodes, taking out 
tokens and distributing them to other nodes. So, this is, of course, expensive. Strategy | is not that 
expensive and also Strategy 2. But so there is clearly a trade-off between the amount of effort we 


put in and the quality of the results. 
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So, this was the original paper Dynamo: Amazon's highly available key value store in operating 
systems review 2007. So, I would request the viewers to take a look at recent avatars of dynamo 
and read some of the recent papers from Amazon to get an idea of the improvements that have 


been done to the system. 


So, one thing that we will appreciate in this course is that subsequent key value stores that were 
used by Facebook and LinkedIn and so on, have a design that follows a very similar philosophy, 


where essentially the idea is to create a large key value store for storing a lot of data. 


So, if I were to summarize, the main points of this of course was 1-Hop routing, which meant we 


had to keep massive routing tables and this also made node addition and deletion expensive. And 


588 


the other was that, so basically every physical node was assigned many virtual positions, which of 
course as we saw were random, but kind of intelligent random where load balancing was explicitly 
taken care of. And the other thing is we did prioritize writes. And one way of doing this was by 
using vector clocks to maintain different versions and merge them, when, either automatically or 


using a business logic based merging where the client does the merge. 
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In this lecture, we will discuss the Google Percolator system. The percolator system is used to 


manage large part of Google's search infrastructure. 
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So, we will first discuss a little bit about Google's search algorithm, the requirements of such an 


algorithm, and then the design of the system, and finally, the evaluation. 
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So, the Google’s web index is very, very large. So, in a certain sense, it indexes the entire web and 
the web is very large. In fact, we do not know how large it is, but at least it has millions of websites 
might be in the billions as well. So, tens of petabytes of data is generated every single day. So, 
Google has to go through all of this data. And it has to figure out which page points to which other 
page and also index all the search items, all the search queries, possible search queries, that can be 


given, all of those words need to be indexed. 
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There are billions of updates per day. So, also, Google has to keep track of the updates. And there 
are thousands of machines. And we will see that the main problem with the percolator based system 


is to deal with cascading updates. So, in the next few slides we will see what these are. 
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So, the core of Google’s search algorithm is the page rank algorithm. So, every page has a page 
rank. So, the page rank is a measure of how popular the page is. So, let us say a given page is very, 
very popular, it has a high page rank, and if a page has low popularity, then it means that it has a 


low page rank. 


So, the page rank of a page is also determined by the page rank of all the pages that link to it. So, 
let us say if a lot of popular pages point to one page, then it will automatically mean that this pages 
page rank should also become high. The reason is that, if extremely popular pages are pointing to 
a page, so this page also becomes popular kind of by default. So, the page rank of this page 


increases. 


Consider an example. So, assume that a very popular newspaper like the New York Times, it points 
to some link, then automatically that link is presumed to be important or is presumed to be popular. 
So, the page rank for this link will also be high. Now, this video is being recorded from IIT Delhi, 
so, and I am professor Sarangi here. So, if my website points to some other websites, then, which 
is my official IIT Delhi website, if it points to some other websites since I am not a very important 


person, my page rank is low. So, I will contribute very little to the page rank of the other page. 
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And of course, if there are many people like me who point to that other page, then that other page 


will also have a very, very low page rank. 


So, essentially, page rank is like a self serving metric, where popular pages point to a page, then 
the page rank of that page is high, and so on and so forth. And if less popular pages even if they 


point to other pages, they do not contribute much when it comes to page rank. 
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So, if I would, so let us say that I take a Google search page, and I just type in my name. So, what 
are the things that Google shows? Well, Google shows that there are 21,400 results. So, again, that 
does not make me very popular. So, actually, Sarangi is the name of an Indian musical instrument. 
And this is probably majority alliance share of the results more than 99% . And the number of 


results with my name is very few. Nevertheless, Google took 0.37 seconds. 


And so let us take a look at what are the results that Google actually got. So, in terms of relevance, 
this is probably the most relevant, which is my official IT Delhi homepage. And with every link, 
Google attaches a piece of text. So, this piece of text over here is known as the anchor text. So, the 
anchor text helps kind of describe what the link is. Additionally, Google went through my page, 
and actually went through all the information, all of the links, it went through all the links, and 


essentially, it is listing the six most popular links along with the anchor text. 
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After that, after my homepage, Google was able to find my LinkedIn profile. So, it turns out there 
are many people with my name. So, I think one of them, the first one is me. So, this it has correctly 
found out. And here also there is a little bit of anchor text. So, these links over here, the search 
results, are arranged by the page rank, in a descending order of page rank, with the most popular 
pages coming first and the less popular, less popular, again, determined by the page rank coming 


later. 


So, with this, this is actually a very good idea. It is a very revolutionary idea. And here the main 
aim is that on the first page itself, the first few links are the most relevant ones. And as I keep 


going, as I keep scrolling down, the relevance of the links reduces, it decreases. 
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So, now how does Google exactly do this? Well, so Google creates a web index. So, this is also 
called an inverted index. Inverted index, so we have a word, so let us see, consider my last name. 
So, with the last name, what it does is it maintains a linked list, where of course a linked list is 
ordered in a descending order of page rank. So, we have the links over here. So, we have multiple 


links. And along with each link, we have the anchor text. 


So, along with each link over here, we have the anchor text as we see. So, if I were to consider my 
first name, again, we would have a set of links. So, now, if the query is my full name, so actually, 
my full name, I have a middle initial, which I do not use. So, Google will probably make no sense 
out of my middle initial. But from my first name, it would find all of these links. From my last 
name, it will find all of these links. It will take an intersection of these sets. So, again, the 
intersection will be ordered in the descending order of page rank. And this is how it will generate 


the search results, which appear over here. 


So, this is only a small part of the story. So, this is the story after the inverted index has been 
created. So, after that searching is actually easy. So, what Google actually does is that Google 
stores parts of this index in a large number of servers, of course, with redundancy. So, we have 
been studying many ways of doing that. So, I will not go over it. And whenever I do a search, it 
quickly locates where my first name would be, it goes to that server, and fetches a list of links. It 


does the same for my last name, computes an intersection and shows me the results. 
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So, the main problem that percolator actually tries to solve, which is by far the most 
computationally heavy part within Google system, is that if a page rank of a website changes or if 
the website itself changes in terms of content, then it is necessary to update Google's internal data 
structures. So, this process is the most computationally intensive. So, what Google used to do prior 


to percolator is that it used to crawl the entire web. 


Crawling the entire web essentially means, so let me write that. Crawling the entire web basically 
means that going to the most popular websites, looking at their links, visiting those links, so on 
and so forth such that an entire web is covered, every single website on the web Google goes 
through it, finds all the other websites it points to, captures all the anchor text and creates the 
inverted list. So, this needs to be done periodically. This is a pretty heavy operation. And this entire 


process is known as crawling. 


So, what can happen is that if a page rank of a certain website changes, we need to update the 
inverted list to reflect this change. The page rank of sites that this site points to need to change. 
So, this is a cascading update. So, what is it? So, let us say this page suddenly becomes more 
popular, its page rank increases, and this page points to two other pages, even their page rank 
increases. Then this page again points to, let us say, three other pages. For each of them their page 


rank increases, so on and so forth. So, this problem is known as a cascading update. 
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And this is, it is by far a problem because we have millions of web pages changing their page ranks 
every day. So, Google has to actually work a lot to ensure that its database is in sync with the web. 
So, prior to percolator what Google used to do is that periodically, let us say, every week or every 
two weeks, it used to crawl the entire web and create the inverted list. But if I want to make the 
system more responsive, and let us say within a span of one or two or three days, and maybe for 


news even less than that, if I want to keep an up to date database, so I cannot crawl the entire web. 


What I have to do is that I have to increase the page rank of one page, go to all the pages at points 
to increase their page rank, so on and so forth. So, of course there, so let us say the page rank 
increases by 5 percent is arbitrary units, the increase in the page rank for the other pages will be 


far lower. So, this might be 2 percent. So, gradually, the effect will die down. 


But essentially it is important to percolate this change. So, this is where the name percolator comes 
from, that whenever there is a change, it needs to be percolated to at least some depth. And this 
used to be expensive. The cost of this was reduced by the percolator project, which essentially 


percolate such kind of updates. 
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So, let us look at the requirements. So, the requirement that Google had is that it wanted to have a 
strict ACID transaction semantics. So, here, of course, what is ACID, this is a prerequisite for this 
lecture. So, we will not go over it. So, any database book or any, so we have also covered it in the 
previous lecture, any book on distributed systems would have the ACID semantics. So, I would 
request the viewers to go over that or maybe read the Wikipedia article on ACID semantics before 


continuing. 


So, the reason for the ACID semantics over here is that Google did not want its database to be in 
an inconsistent state, because poor search results is something that Google was not comfortable 
with. So, it should have high throughput, needless to say, and an acceptable latency which means 
that either a search query or an update should not take a very long time. And of course the 
throughput should be high, because you are talking of a large number of changes should be able 
to handle petabytes of data. And for this, you clearly cannot use traditional database systems, 


because they are too slow. They are inefficient. 


So, we need random access to data such that the changes can percolate. So, this is also required, 
because it will not be a sequential scan of the repository. So, it will be one page pointing to a few 
other pages, which are again pointing to a few other pages. So, this will be random access to data 
as required. And furthermore, the consistency models, so there are a set of consistency models, 


serializability, strict serializability, conflict serializable, view serializable. 
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So, I would suggest that the readers before proceeding forward get an idea of what is serializability 
which essentially means that all the transactions can be organized in the serial schedule. So, this 
is serializability. But of course Google did not go for that. So, we will see in a second why. They 
went for another consistency model which is quite popular for large datasets. It is called snapshot 


isolation. 
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So, snapshot isolation can be explained very well in the context of linked lists. So, consider 
traditional link list, so let us select this element a, b, c. Now, consider two transactions T1 and T2. 


So, assume T1 wants to modify the contents of element a, and T2 wants to modify the contents of 
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element c. So, what would happen in this case is that, so what Tl would actually do, so assume 


that, in this case, let us say, T2 goes first. 


So, if T2 goes first then T2 would essentially traverse the linked list, then it will first traverse a 
cell. It might read its contents as well, assume it does. Then from a it will go to b, then from b it 
will go toc. And onc it will do the write. So, what Tl would do after that is after T2 has gone, T1 
would basically, from the first cell, it would finally reach a, and on a it would do the write. So, if 
you think about it, there is a read write conflict over here. Because we are assuming that while 


traversing the linked list, we are reading each and every element. 


And so even if you are not reading the elements, we are at least reading the next pointer. And if 
this a was to be deleted, the next pointer of this would change. But I mean, let us assume for the 
sake of simplicity, that the contents of each of these linked list elements are being read. So, one 
thing is clear that T2 has read a, it has done a read access on a and T1 is doing write access on a. 


So, between T1 and T2, there is a conflict. 


So, Tl and T2 both are conflicting, which means in a traditional transaction system, whenever 
there is a read write conflict of the same data, then clearly one of them needs to abort. But if we 
actually think about it, then in this case none of the transactions actually have to abort. Why is this 


the case? 


Well, the case is like this that even though T2 is just scanning through the linked list, it is not really 
interested in the contents of a, b or anything else, it is only interested in the contents of c where it 
just has to cross a, b to get to c. Once it gets to c it performs the write and it writes a new value. 


And a is, and T1 is interested in a and it is changing the contents of a. 


So, we can allow T1 and T2 to proceed in parallel. So, they would technically not cause an issue. 
And the reason is that T2 just wants to cross a and go to the other side. It is not really interested 
what exactly you do with a and that is exactly why even if Tl e comes before T2 or T1 comes after 


T2, the results are the same. So, why do we typically have a conflict in a distributed system? 


Well, we have a conflict because the order of the transactions matters if Tl comes first and T2 
comes later, or if T2 comes first and T1 comes later, then the final outcomes are different. But in 


this case while updating a linked list as long as the elements are physically different, which in this 
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case they are, a and c are different, what is happening is that essentially the order of transactions 


T1 and T2 does not matter. Hence, they can run in parallel. 


So, this behavior will typically not be allowed in many database consistency systems. But snapshot 
isolation will allow. So, if you think about it, what T2 is essentially doing is that before it starts is 


kind of taking a photograph of the system. It is going to see and making a change over there. 


What T1 is doing is it is taking a photograph of the system, it is going to a, and writing its updates 
over there. And for any transaction that is coming after T1 and T2, so let us say after T2 we have 
T3, and similarly after Tl, we have T3, even if Tl and T2 might not see each other’s updates or 
might see, well, we do not know. So, it does not matter. But T3, if it is coming after both of them, 


it will see the updates are both T1 and T2. So, this is called snapshot isolation. 


So, snapshot isolation would essentially allow this behavior. So, it is also called snapshot isolation 
semantics. So, what this means is that before execution, before starting, so you take a snapshot, 
which means you create a full copy, so it is a logical copy, it is not a physical copy. But logically 
it appears as if you have created a copy, then you operate on the snapshot and you create a new 


version of it. 


Now, for any transaction that is, so if let us say there are multiple concurrent transactions, so let 
us say there was another T1’, and let us say that was changing some other element over here d. So, 
as far as T1 dash is concerned, it does not care about the updates of Tl and T2, because they are 
happening to different elements. So, whenever it starts, it will, that semantics is as if it takes a 


photograph of the entire system, then it traverses to d and makes its changes over there. 


And of course, if T3 is coming after T1’, it will see the updates of T1’ as well. So, this is known 
as a snapshot isolation semantics, where essentially if we start at some point, just before the 
transaction starting it appears as if we are taking a photograph of the entire system, then we do the 
perform the reads and writes on our snapshot. And as long as parallel transactions do not actually 
modify the same exact piece of data, we are fine. And then once all that parallel transactions end, 
every subsequent transaction sees all the updates. So, this is snapshot isolation. And this is exactly 


what Google went for. 
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Which, of course, this is weaker than many of the known consistency models like serializability, 
for example. However, this is pretty much what we can afford in a large distributed system, 
because recall the CAP theorem, there is a trade-off between consistency, availability and tolerance 
to partitions. So, if we want a higher availability, we need to reduce our consistency requirements, 


which in a certain sense has been done here. 


So, the example that I gave about linked lists is summarized over here, that if the same node or its 
parent is not being accessed, we can have parallel accesses. And when a transaction starts, it takes 
a consistent snapshot of the entire database. Then the updates are as if, as if means appear to be 
updating the private copy of the database and the values are committed if they have not been 
changed by another transaction since the snapshot. So, it does, snapshot isolation still does not 


admit concurrent updates to the same item. 
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Now, the structure, so the structure of percolator is like this that percolator is built on other Google 
technologies. So, the technology that it is actually built on is called Bigtable and Google File 
System. So, let us discuss Google Bigtable. So, Google Bigtable is essentially a large multi- 
dimensional database. So, you can think of as a large 2D or 3D or 4D database. But as far as we 


are concerned, we will only look at a 2D database. 


So, consider a Bigtable to be a huge table, which has maybe like a million entries and each row 
over here, so it is a row column format, so maybe each row has like 1,000 entries. So, this is huge. 


So, Bigtable can be treated as a distributed key value store, where a primary key of each row is 
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essentially the key for that row, a unique identifier, and the value would be the entire contents of 


the row. 


So, of course, here we save the rows, the contents of the row, the contents of the column and a 
timestamp. So, what Bigtable gives us is that in this large setup, it provides us atomic read modify, 
write operations for each row. And of course, we can have each row has a large number of columns. 
It allows us to update all of the columns simultaneously, atomically. And there are separate 
columns for regular data and metadata. So, this is the basic framework that Google percolator is 


built on top of. 


So, if I were to explain this slightly better, this is how it would look. So, let me go back to our 
notes over here. So, let us go to a new page. So, essentially, the core technologies that percolator 
uses are Google Bigtable and Google File System. So, Google File System is a distributed file 


system on the lines of the coda file system that we have studied. So, I am not discussing that. 


So, what we essentially do is that, so the Bigtable, as I said, is a huge table with a very, very large 
number of rows (10)”, where n is a large number. And for each of them also we have a lot of 
columns. So, Bigtable, so every row is uniquely identified by a key. And value is of course the 
contents of the row. So, we can think of Bigtable as essentially a key value store. And we can use 


a regular DHT based mechanism to store the entries. 


So, what Bigtable does is that it partitions this table into a set of tablets. So, the tablets can either 
be row wise or they can be rectangular grids like this, so where you would have multiple rows and 
columns. So, the Bigtable is pretty much broken down into a set of tablets. And we have dedicated 
tablet servers whose job is to supply the values of these tablets. Now, the question is, how are these 


tablets stored? 


Well, the tablets are stored as regular flat files in a file system. So, essentially, percolator is built 
on Bigtable. So, if I were to draw the diagram, Google percolator is built on Bigtable, which is 
also a Google system. And there is a beautiful paper that describes it. So, what is Bigtable? Well, 
it is a large multi-dimensional database, you can think of it as a key value store, and large sections 
of it, which can, so we can divide this in many ways. It can be multiple rows, or it can be the way 


we have divided it into grids. So, each such grid is known as a tablet. 
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So, we have dedicated tablet servers that provide these tablets. And each of these tablets is stored 
as a flat file. And this is stored in the Google File System or the GFS stores. So, it actually stores 
something called chunks, but each chunk in this case can be thought of as a signal tablet. And this 


is stored in the Google File System. 


And Google File System provides all kinds of guarantees. For example, it provides reliability 
guarantees, that you will always get it. It is provides guarantees of availability that certain set of 
files which in this case certain tablets are always available. And it has many other useful properties, 


which is high throughput and low latency and so on. 


So, GFS is used to implement Bigtable which is used to implement percolator. And in percolator 
is, you can think of it as a sophisticated library on top of Bigtable. We will see how. And so before 
a reader asks, why not just use Bigtable. Well, the reason is that Bigtable is essentially a very flat 
DHT like system. So, it does not have some of the additional features that percolator requires. So, 


what are these features that percolator requires? 


So, what it requires is it needs multi-row transactions. Why is this? Well, if we update the page 
rank of one page and then we want to update the page rank of all the pages it points to, they might 
be on other rows of the Bigtable. So, we would like to update all of them together in one go. There 


is a multi-row transaction. 


Second, so basically this would of course have atomicity as one of its requirements. So, this is 
what something that comes along with multi-row transactions. The second, we need an observer 
framework which essentially means that if the page rank of one page changes, automatically we 
go to all the pages it points to and then we update their pages. So, the question is do we do it 


immediate or do we do it later? 


So, in general, we defer this. Whenever we have the competitional bandwidth we do it, which 
means that there has to be a dedicated process called an observer to look for changes to pages. So, 
it stores that information somewhere and then it comes back and then does the updates, essentially 


does the job of percolation. So, let us go back. 
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So, as we just discussed, any row, any entry has a set of observers, and whenever the data changes, 
which means the change in content, the change in anchor text, the change in the page rank, 


specialized functions are run to propagate or percolate these changes. 
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So, of course, ACID transactions are hard to do in such a large database. So, we do not want to 
leave the database inconsistent. So, that is the reason we use a timestamp for each data item. And 
at the beginning of a transaction, the set of timestamps where all the data items that are being 


modified or accessed, we can think of that as the snapshot of the transaction. 


So, in this case, transactions involve multiple rows as we have just discussed. So, there is a need 
to implement the snapshot isolation. So, that would require both timestamps as well as logs. So, 
snapshot isolation semantics, to implement it we would need regular logs, along with that we 


would need timestamps. 
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So, let us now look at, so what is percolator again. Well, it is an instantiation of Bigtable and in 
that we add a few extra additional columns. So, we add a log column that contains a pointer to the 
log, we will see how, the timestamp of committed data, write, the data value, the list of observers 
and the last time at which a certain observer O ran. So, at this point it is not necessary to memorize 


the contents of this table. We will be coming back to this table over and over again. 
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So, let us look at a simple example. I was able to get the rupee symbol. And so the Indian rupee 


symbol has, is there in letech right now. So, it can be use, which I have done. So, now, let us take 
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a look at a sample transaction where I just transfer money. So, of course, this is not the same as 
percolator, but this gives you an idea of percolator’s logic. So, percolator uses exactly the same 


logic. 


So, at the beginning, so let us say a timestamp 5 or entry B, we have 2 rupees in B’s account. And 
at timestamp 6 we have said that look the final data resides at timestamp 5. So, this 5 points over 
here. Similarly, A has 10 rupees and we also record the fact that the final data is at timestamp 5 
which means A has 10 rupees. So, we are using the key column which is the ID of the agent A or 


B, the data column that stores the actual contents. 


So, instead of rupees over here in the case of percolator it will actually be the contents of the links 
and anchor texts and so on. And there are log column which we are not using right now, but we 
will. And a write column that says where does the final value lie, so this is used for committing a 
transaction. So, this is a very useful gadget which we will used to commit a transaction. So, this is 


what we also used to differentiate between a temporary write and a permanent write. 


So, now what happens is let us assume A would like to transfer 7 rupees to B, so in this case what 
we do is we create a new timestamp 7, where we decrement 10 to 3. Furthermore, we acquire lock 
on this row. So, we say that this is a primary lock, because this is, this A is starting the transaction. 


So, it is the primary. So, we acquire a lock on 7 and we say that 7 is a primary. 


But mind you, here is the important point. We do not commit the value. So, we still say that the 
final data item is at 5, which is 10. 3 is just a temporary write. It is not committed. And this is 


where this column is useful. 
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Then what we do it, we acquire a lock again on B by a new timestamp and we also acquire a lock 
on B’s row, but we say that look the primary owner of the lock is A and because this is a secondary 
transaction. In this case, we credit the 7 rupees to B, so from 2 it becomes 9. Again, we do not 


commit the transaction. 


Now, what we do is that, we are in a position to commit the transaction because we have acquired 
the logs. So, this is very similar to 2-phase commit, where we first acquire kind of an initial 
commitment which is similar to a lock and then we do the final commit. So, this is very, very 


similar to 2-phase commit. 
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So, in this case what we do is for the final write, we let go of the primary lock at A and we say that 
the data is at 7, which means that the final data of A is 3 rupees. And then we come to B and B 
still has the lock, so then we come at B and we say, so we get rid of B’s log. So, if you can see, 


there is no lock with B again and we say that the final data is at timestamp 7 which is 9 rupees. 


So, this completes the process of a transaction. So, what we will do in updating web index is very 
similar. Instead of money when we store web indices we also acquire lock to multiples rows, we 
make the changes and then we do the commit, which is a 2-phase commit mechanism which works 


in this particular manner. 
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Now, the exact algorithm, so the exact algorithm works like this that when we begin a transaction, 
we assume that there is one Oracle machine in the system. So, Oracle is a Chinese concept. It is 
an entity that knows everything. Well, in this case, it does not. But what it does is, it provides a 
timestamp which monotonically increases. So, it is a global timestamp that increases 
monotonically. So, before I start a transaction I acquire a start transaction timestamp. And then for 
all my writes, I just push my writes into a temporary buffer. So, there is the writes is a temporary 


buffer. 
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So, then what I do is that I, for my get operation, which is getting a row and a column, so for 
getting the contents of a row and a column what do I do. Well, I start a while loop. And so for the 
first row I start a transaction. If this transaction is between O and startT's which is when I started 
my transaction, somebody has acquired a lock, so which means there is an outstanding lock, then 
of course there is a problem and it means this row cannot be locked, because the row is currently 


locked. So, I back off and maybe remove the lock. 


And so in this case so we will discuss why maybe remove the lock, because it is possible that the 
process that I locked this row as died, so then of course I wait for some time. And still if I get a 
feeling that maybe the process that should have remove this lock has died, I try to forcibly remove 
the lock. We will see that this does not cause problems and so I would refer the readers to the 


paper, the viewers to the paper, where this issue is discussed. 


But what is the main idea? The main idea is that nobody should have locked this row. If somebody 
has, I kind of back off and I try again. And if I get a feeling that this process has died, I try to 
forcibly remove the lock. Otherwise, I do not. Then I find the latest write, which is I read it. Ido a 
T. read, and I find the latest write, which is for this row between 0 and start Ts what is the latest 


write. 


So, if there is no latest write, I return null, which means that this has not been returned to, otherwise 
I find the timestamp of the latest write and what do I return. Well, I read the row. So, I return of 
course the data and so the data that is read along with the timestamp of the data, so both of these I 


return. 
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So, while writing I have two stages, one is called pre-writing and writing similar to 2-phase 
commit. So, in this case what I do is I first find the column number that I want to write to. I start a 
transaction on the row, then for the given row, I see if anybody has already written between startTs 
and infinity, which means that after I started if there is a parallel write by some other transaction, 


if there is a parallel write, then I return false. 


So, recall, so for understanding this, just go back to this slide. So, recall that for every row, we just 
have a set of data with different timestamps. So, this is exactly what I am doing over here, that 


after I started my transaction basically within my start and infinity, if there is a parallel write, then 
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clearly my pre-write will not be successful. So, we return a false. I also see that if anybody has 
acquired a lock, if there is any outstanding lock, then also I return false, because I will not be able 


to do the write. 


If both of these conditions are false, I exit this function. Otherwise, I remain in this function. And 
then the data what do I write. Well, for the data field I write the startT's which is my transactions 
starting timestamp and the value. In addition, I write the lock and the starting timestamp, and also 
the row and column of the primary, which is the initiating, so one of the rows is the initiator of the 
transaction. So, I write its ID as the primary lock owner. And then I call the commit function, 


which is the second phase of 2-phase commit. 
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So, in commit what I do is I first pre-write all the entries. So, for the both the primary and the 
secondary, I pre-write all the entries which means I take all the writes and I try to pre-write them. 
If I cannot write the primary then of course I return false. If I cannot write this any of the 
secondaries then also I return false. But if all of them work out, then what I do is I get commit 


timestamp from the Oracle and try to start the commit process. 
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Design Algorithm 
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Let us now look at the second phase of the commit process which is after all the pre-writes have 
been performed. So, in this case, we commit the primary first. So, to commit the primary we start 
a transaction on the row of the primary. So, kindly understand that it is possible that between the 
start of a transaction and this point, so if I were to draw a line, if this is the start of a transaction 
and this is the current point which is line number 12, so in between this it is possible that some 


other transaction might have aborted this transaction. Why would this be the case? 


Well, the reason that this would be the case is basically on this slide, where we actually take a lock 
for a row. So, let us say we take a lock for a row and the current transaction actually holds it for a 
long time, other transactions might perceive that maybe this certain row, which is the row of the 
primary, will never get unlocked, because the transaction that was supposed to unlock it has died. 
So, in this case, we need to check if the transaction is still alive, which means if the transaction has 


been aborted by somebody else or not. 


So, the way we do it is that in the primary row, we check if at the startTs time instant, the lock is 
still there or not. If it is there, if we can read a lock, then well everything is good. If we cannot read 
a lock, then it means that the transaction has been aborted. So, we return false. So, we do not 
proceed with the commit. Otherwise, if the lock is there, this means that the transaction has not 


been aborted, so we can proceed and commit. 


So, what we do is that we write the primary, and erase the lock. So, we write in the primary’s row 


we write and we set the commit timestamp to basically say that the data is at the startTs. So if you 
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would recall, the way that we were actually committing data is essentially making a record in the 
commit column, in the final data column, that the data is, the data exists at a given timestamp, and 
this was the process of committing, if you would go back to the table that we were showing where 
A transfers money to B. So, the moment we make this entry that the final data resides at timestamp 


startTs, this is akin to committing, so this is what we do. 


So, in a transaction, we commit the data as well as we erase the lock. So, we removed the lock on 
the primary row and this process is the commit process. So, then what we do is, we do, so we 
would see many of these expressions at different points. So, I should have maybe explained it. Let 
me do it right now. So, we are seeing T dot commit over here and we are also seeing another T dot 
commit over here. So, note that these are Bigtable transactions, these are not percolator 


transactions. 


So, the percolator transaction is for these high level commit and pre-write messages. But these 
transactions that you see which is start transaction over here and this commit, these are Bigtable 
transactions. So, essentially, so this the Bigtable takes care of, we do not. So, in this case, if we 
have written everything that we need to write to percolator, then the Bigtable transaction has to be 
committed features akin to doing both of these things, setting the write as well as the lock for the 


same row. 


So, recall that Bigtable provides transactional access to single rows. So, in this case, what we are 
doing is that we are committing the Bigtable transaction T.commit(). And we cannot do it, we 


return false, else we continue. 
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So, we do the same. So, once the primary has been committed, so there is no question of not 
finishing the commit. So, for all the secondary rows, so recall that Bigtable, so let me maybe write 
it. So, recall that Bigtable does not support multi-row transactions, so percolate does. So, what we 
will do is we will use Bigtable’s basic transactional infrastructure for single row transactions, 


which we saw over here, starting a Bigtable transaction and committing it and also doing the same 


for secondaries. 


So, in this case, what we do is that for each of the secondaries we write the row, we write the write, 


and we perform the commit, which says the data is at this point. And furthermore, we erase the 
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lock that the secondaries would have acquired. So, what do we do? We first look at the primary 
and then we commit the data of the primary and we then erase the lock. And then we come to the 


secondaries. And for each of the secondaries, we erase the lock. 


So, let us assume that the thread that is doing this fails at some point over here. Well, that is not 
an issue. The reason that that is not an issue is basically because if we take a look at the lock that 
a secondary holds, so essentially the lock is of this form, it points to the primary. So, I can give an 
example with the code we had here. So, if you can see the lock is basically saying that the primary 


is at A. 


So, just in case if any of these threads actually fail in the middle, we will at least get to know that 
who holds the parent lock. We can go to the primary. If that still holds the lock, we will know that 
the primary is not done yet. If that does not hold the lock, we will know that the primary is done. 
From that we can get the details of the transaction and the same can be implemented at the 
secondary just in case some work is remaining. And after that, we can finish the job of the 


secondary. 
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So, a few details and optimizations, well, so the central point for us was the timestamp Oracle, 
which needs to sustain a very high throughput, because all the nodes essentially ask the Oracle for 
a timestamp which is a global value, all of them ask it for a timestamp. So, this is of course slow. 
But one important point that needs to be kept in mind is that latency is not really a big concern for 


us in the design of percolator, but throughput definitely is. 


So, one thing we can do is from the same node, we can batch several RPC calls, not one, but we 
can create a batch of several and we can send it to the Oracle to reduce the overall network load. 
And if the Oracle let us say fails, then it needs to recover, but at least in some non-volatile storage, 
it needs to store the last timestamp that it had issued. So, any subsequent timestamp needs to be 


more than the timestamp that has been issued in the past. 
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So, the main aim of having a percolator kind of system was to have observers in the sense that if 
one entry changes, then other entries will also change. So, we want changes to percolate. So, what 
happens is that each observer registers a set of columns and a function. So maybe this can be shown 
better by drawing a more elaborate diagram. So, if we take a look at a row in percolator. So, 


essentially, for every row ID we have a set of columns. 


So, it is possible that the data in any of these columns changes. So, if the data changes then it 
means that the changes have to percolate. So, we can define an observer thread, so an observer 


trade or an observer process what that does is that that essentially observes a few of these columns. 
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So, this is an asynchronous process. So, what we do is that whenever we change a row, we have a 


certain notify column. 


So, in notify column we just set a bit from 0 to 1. So, once we set a bit from 0 to 1 in the notify 
column, what the observer process is going to do is that it will scan the entire percolator database 
till it finds a 1 in the notify column. So, then based on the columns that have changed, it will go 
and access a few more rows and make changes over there. It will go and access maybe this row 


over here, maybe this row over here and it will make changes. 


So, just to explain this process once again, whenever there is some change in the database, 
whenever page rank changes, so the process that changes the page rank accesses a row of 
percolator, and it need not, so since it is crawling the entire web repository, it need not make 
changes to other websites immediately. So, what it can do is it can just set | in the notify column 
such that an observer thread later picks it up and makes the changes. Once it has atomically made 
the changes, it can set the notify bid to 0. And furthermore, it can set up an acknowledgment field 


that with the timestamp at which the observer ran. 


So, the reason we need the timestamp is that we will never have two concurrent observers because 
it is possible that two observer threads can see the notify bit to be 1 and both can be running 
concurrently. So, of course, concurrent modifications our system will not allow because of 
snapshot isolation, one of them will fail. And another quick mechanism of ensuring that we do not 
have this last work is actually via the acknowledgment field. So, via this field, we will get to know 


that an observer has already run and it has done its job. So, we will not try to run once again. 


So, this is pretty much what is being shown on the slide. Each observer registers a set of columns 
and a function. The function gets invoked, if any of the columns get updated. So, here what we 
can do is we can collapse messages. So, it is just to reduce the amount of work. We need not do it 
for a single change. But for a set of changes we can run the observer once. And for each column, 


for each commit per column, we will just run one observer transaction. 


So, the way that this works is that the worker thread ultimately picks up this information, runs the 
observer. If the latest timestamp of an observer ack_O > commit_Ts of the update, then it means 


that we need to run the observer because it would not have seen the update. But if the observer’s 
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timestamp which is ack_O is greater than the commit timestamp, then it means that it has seen the 


update and it has done the job. 


So, what we actually, what the authors actually found is a moment we have many concurrent 
observer threads, most of them start clumping in similar areas, which means that there is a lot of 
contention at different parts of the database, a large parts of it are empty. So, that is the reason 
what they actually suggested is that different threads can start at random parts of the database and 
simply start scanning it. Whenever they see a notify bit to be 1, they will start propagating the 
update, percolating the update. And so this will ensure that a large part of the database gets covered 


and we do not have a lot of queuing at the same place. 
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Performance improvement, so basically we need support for read modify write RPCs in Bigtable. 


So, this is already, this is something that percolator introduced. RPC calls were batched, which 
means that calls to Bigtable. If there are multiple calls to the same region, then a single RPC call 
was made. So, recall that an RPC is a remote procedure calls, I will not be discussing this here, 
because this was discussed in the past, to reduce reads, particularly reads of large parts of the 


Bigtable tablets, some pre-fetching is done. 


So, this reduces the read time and that to vary significantly. So, there are numbers in the paper of 
how much pre-fetching actually reduces. And also the programming model is simple with a 


blocking API and a large number of threads. So, the percolator as such is a very simple system to 
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use, which Google has built primarily for its own applications, where they make some simple 
modifications to Bigtable and ensure that as compared to their previous system, which we will 
discuss very shortly, percolator provides a big advantage in terms of gradually evolving the 


database as the web changes. 


So, what are again the main, some of the main Bigtable specific modifications, support for a read 
modify write RPC, where a single RPC can atomically change the contents of a Bigtable row, 
create batches of RPC calls, which means that there are multiple writes, instead of sending multiple 
messages we send a single message, and we employ pretty fetching for reads. So, this improves 


the performance significantly. 
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4 Large-scale Incremental Processing Using Distributed 
Transactions and Notifications by Daniel Peng and Frank 
Dabek, OSDI, 2010. 
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So, the setup, well, the setup was that we crawl billions of documents. If you would go to the 
percolator paper, it was published in 2010, so most of the results that are shown over here are 
roughly predate 2010. So, I would say maybe 2009. And so in 2009, also, the web was massive, 
we had billions of documents. So, the default baseline system that actually Google was using was 
map reduce. So, map reduce is a paradigm that has been there for quite some time. So, I would not 
be covering the details of it, but readers, viewers are requested to take a look at map reduce 


elsewhere. 


So, there, the basic idea is that we take a large computation, we break it into several small parts, 
which is the map part, then each of the processing elements does the computation, and finally, all 
of them are collated into a single point. So, all of them, so we essentially take the different 


computations and emerge their results and this is called the reduction step. 


So, assume that we want to add a million numbers. So, the initial data set is the million numbers. 
Then we break it down. If let us say we have 100 processes, then we break it down into 100 parts. 
Each processor gets 10,000 numbers. Then what we do is that we simply add them together. So, 


this is the reduce step. 


So, the map reduce system was fine and it typically required many rounds of map reduce to process 
the entire web repository, so typically 50 to 100 rounds and it took a long time for a document to 
finally get indexed. So, what Google used to do those days was that every 14 days, it used to re- 


crawl the web. And then it used to run a full map reduce job to create an index for the web. So, 
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this, and creating the index took two or three days. So, I recall way back in 2006-2007, once a 


website was created, it used to take more than 15 days for it to actually appear in the search results. 


So, this period got reduced substantially by the percolator-based system per caffeine which is 100 
times faster. And so in the percolator system the main advantage is that you do not have to basically 
process the entire, we do not have to process the entire repository. This is something map reduce 
had to do. So, we basically just process the affected nodes, all the nodes that are affected. So, let 
us say if a website changes its page rank, then all the other sites which are affected by this only 


those sites we need to change their data. 


So, essentially, the time it takes is proportional to the number of websites whose page rank or 
content or some attribute changes. So, that is why it is much faster and also the average age of 
documents in the entire search results that get reduced by 50 percent. So, we see far newer 


documents. 
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Performance vs Crawl Rate 


@ Crawl rate + Percentage of repository that is updated per 
hour. 

@ Let us plot the clustering latency (y axis) vs the crawl rate 
(x axis) 

@ For Map-reduce it starts at 2200s and rises to infinity when 
the crawl rate exceeds 33%. 


@ For Percolator it remains below 200s till about 37%. Then 
it continues to rise. 


3 minus 20 Sees 


Smruti R. Sarangi Google Percolator 


Let us know evaluate the performance. So, the crawl rate is defined as the percentage of the 
repository that is updated per hour. So, let us plot look at the clustering latency. So, the time it 
takes to cluster similar documents, duplicates of documents, or the time it takes to cluster different 
documents which have the same content. So, I said, given the content, we can then take all the 


documents in the cluster and arrange them according to their page ranks. So, this clustering latency, 


626 


if that is the y-axis and the crawl rate is x-axis, we will find that for map reduce it starts at 2200 


seconds and rises to infinity when the crawl rate exceeds 33 %. 


So, even for a baseline map-reduce with a very low crawl rate percentage it kind of starts at 2200 
seconds and it gradually rises to infinity, which means that after 33 % for map-reduce the graph 
would kind of go like that. But for percolator as long as the crawl rate is limited to 37 %, it actually 
remains very close to 0. It remains roughly around 200 seconds, which is only 3 minutes. So, 3 
minutes is kind of negligible in the web scale. This is 3 minutes and 20 seconds. So, this is 
genuinely very small. So, till around 37 % it remains the same and after that percolator is not able 


to manage. 


So, similar to map-reduce, the time that it also takes kind of explodes. So, this is a very important 
and useful statistics. So, we will typically never go beyond 30 %. So, we will never update an 
entire repository, we will never update more than 30 % of the repository per hour. So, this is like 
saying that 30 % of the entire worldwide web is changing per hour, which is not the case, so which 
means that most of the time will remain in the safe zone. And in the safe zone, as we can see, 


percolator is roughly 10 times faster, if not more than a comparable map-reduce job. 


And of course, map-reduce has many, many more overheads in terms of multiple iterations and so 
on and percolator does not have that, mainly because it only tracks those entries, those rows that 


are affected. 
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Scalability for TPC/E benchmarks 


@ The transactions per second (TPS) varies linearly as we 
scale the number of cores. 


® 4000 TPS is achieved with 5,000 cores. 
@ It increases to 12,000 TPS for 15,000 cores. 


Close to Linear Scaling 
Sub -lincor 


= qyucut ng. blaching 


Google Percolator 


So, similar experiments were done with the popular TPC benchmarks for measuring the 
transactions per second. So, this varies linearly, of course, as they scale the number of cores, which 
is always a good thing. Linear scaling is always very, very, a piece of very good news, which 
means that we do not have any bottlenecks. So, this was the case. So, 4000 transactions per second 


were achieved at 5000 cores, and this increase to roughly 12,000 for 15,000 cores. 


So, we saw we see a linear scaling, which means that, so when do we not see linear scaling like 
when do we see sub-linear scaling. So, note we will never see super-linear, but we will see sub- 
linear particularly when there is some amount of cueing, delay or there is some kind of a blocking 


delay. 


So, there is essentially some kind of a place where there is some kind of a packet jam. A lot of 
network packets are not able to flow through a given router. If that is the case, we will find that 
the network is not scaling. So, the moment it scales linearly with the number of cores, so we are 


certain that there are no bottlenecks in the system. 
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So, this was the description of our percolator system, the Google percolator system, which as we 
can see differs, so it is kind of somewhere between a DHT, of course, DHTs have very little 
structure so to speak, other than the fact that they are just raw key value stores. So, it lies 
somewhere between a DHT and it lies somewhere between a traditional DBMS. So, in the 
traditional DBMS, there is a massive amount of structure, you have tables, you have relations 


between tables, you have primary keys, you have foreign keys, you have a lot of relations. 


But in this case, we have a single Bigtable. We just have relations between rows. And the Bigtable 
itself is kind of structured as a DHT, but percolator does not get to see most of that. So, it was, it 
is somewhere in the middle of the spectrum, which is kind of a good thing, because we get the 


scalability of DHTs and we get the ACID semantics of traditional DBMS systems. 
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Advanced Distributed System 
Professor Smriti R. Sarangi 
Department of Computer Science and Engineering 
Indian Institute of Technology, Delhi 
Lecture 20 
Corona: Distributed publish-subscribe system 


In this lecture, we will describe the corona publish subscribe system. So, this is an extension 
of what we have been studying in the distributed with regards to distributed hash tables, 


particularly this is an extension of pastry. So, we will see how to use it in this setting. 


(Refer Slide Time: 0:37) 
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‘Smruti R. Sarangi Distributed Publish Subscribe Systems 


So, first we will describe the general structure of publish subscribe systems. Then we will 
discuss the design of Corona and finally, we will discuss the results the evaluation results of 


Corona. 
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Motivation Motivation 


| Motivation 


@ The world wide web has a lot of information that changes 
frequently. 
@ There is no well defined publish-subscribe interface. 
@ Publish : Post updates, and send to all the subscribers. 
@ Subscribe : Register to get updates. 
@ Polling based methods are not efficient. 
@ Corona provides a scalable method to disseminate updates. 


‘Smruti R. Sarangi Distributed Publish Subscribe Systems 


So, let us first describe the broad structure of a publish subscribe system. So, the broad structure 


is like this. 


(Refer Slide Time: 1:11) 
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+ Stock priced Publishers 
+ NEW feeds » Register 


So, in any public subscribe system, we have a set of publishers and we have a set of subscribers. 
So, let us say that this is a hypothetical space. So, all of the publishers publish topics of interest 
publish stories of interest to the Pub Sub system so, what could this be? Well, this could be 
something like this, that so, let us consider the stock market. So, if I am a broker, I might be 


interested in let us say the stocks or 5 different companies. 
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So, then whenever the stocks whenever the stock of any company changes, so, some dedicated 
server can just publish the updates to a pub sub system there might be a set of subscribers who 
would be interested in a subset of the stocks. So, the moment that there is a change a message 
will go from the Pub Sub system to the subscribers whichever subscriber is interested telling 
them that look there is a change that is an update. So, in the case of a stock market, one of the 
stocks that you were interested in that stock has changed. So, you would like to take a look at 


the new stock price. 


So, in general what happens for subscribers is that they express interest in certain topics. So, 
this is notified to the Pub Sub system similarly, the publishers also the first register with the 
Pub Sub system and then they publish their updates so, this can be anything this can be stock 
feeds, stock prices, these can be news feeds. So, it can be anything so, if you will see one of 


the most common Pub Sub systems is called an RSS system. 


So, RSS or atoms you would see RSS or Atom links with popular news websites. So, in this 
case what can happen is that I can take RSS feed reader like aggregator for example. And then 
whenever there is an RSS update that that will be shown to me. So, whenever there is a new 


news story that will be shown to me. 
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So, Iam showing you the RSS page of Times of India is a very popular newspaper in India. 


So, if you see, So, the Times of India is publishing stories with respect to many topics. So, top 
stories India World Business cricket. So, if I want, I can subscribe to any of these topics, the 
moment that there is a new story in let us say, for example, science, I will have a dedicated 
RSS feed reader. And the dedicated RSS feed reader will show me the updated sites. So, let me 


click cricket, for example. 
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My.Netscape.Com portal"! This version became known as RSS 0.9.! In July 1999, Dan Libby of Netscape produced a new version, RSS 0.91"! which simplified the 
format by removing RDF elements and incorporating elements from Dave Winer's news syndication format.!*! Libby also renamed the format from RDF to RSS Rich 


‘Site Summary and outlined further development of the format in a “futures document” |"! 


This would be Netscape's last participation in RSS development for eight years. As RSS was being embraced by web publishers who wanted their feeds to be used on 
MyNetscape.Com and other early RSS portals, Netscape dropped RSS support from My.Netscape.Com in April 2001 during new owner AOL's restructuring of the 


company, also removing documentation and tools that supported the format!" 


Two parties emerged to fil the void, with neither Netscape's help nor approval: The RSS-DEV Working Group and Dave Winer, whose UserLand Software had 6 


published some of the first publishing tools outside Netscape that could read and write RSS. 


Winer published a modified version of the RSS 0.91 specification on the UserLand website, covering how it was being used in his company's products, and claimed 
copyright to the document"! A few months later, UserLand fled a U.S. trademark registration for RSS, but failed to respond to a USPTO trademark examiners request 


and the request was rejected in December 2008!” 


The RSS-DEV Working Group, a project whose members included Guha and representatives of O'Reilly Media and Moreover, produced RSS 1.0 in December 2000.!"*! 
This new version, which reclaimed the name RDF Site Summary from RSS 0.9, reintroduced support for RDF and added XML namespaces support, adopting elements 


from standard metadata vocabularies such as Dublin Core. 


In December 2000, Winer released RSS 0.92'""! a minor set of changes aside from the introduction of the enclosure element, which permitted audio files to be carried in 
RSS feeds and helped spark podcasting, He also released drafts of RSS 0.93 and RSS 0.94 that were subsequently withdrawn |" 


In September 2002, Winer released a major new version of the format, RSS 2.0, that redubbed its initials Really Simple Syndication, RSS 2.0 removed the type attribute 


added in the RSS 0.94 draft and added support for namespaces. To preserve backward compatibility with RSS 0.92, namespace support applies only to other content 
included within an RSS 2.0 feed, not the RSS 2.0 elements themselves!" (Although other standards such as Atom attempt to correct this imitation, RSS feeds are not 
‘aggregated with other content often enough to shift the popularity from RSS to other formats having ful namespace suppor.) 


Because neither Winer nor the RSS-DEV Working Group had Netscape's involvement, they could not make an official claim on the RSS name or format. This has fueled 
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In July 2003, Winer and UserLand Software 


ongoing controversy!*"*°" in the syndication development community as to which entity was the proper publisher of RSS. 


One product of that contentious debate was the creation of an alternative syndication format, Atom, that began in June 2003.!"'! The Atom syndication format, whose 
creation was in part motivated by a desire to get a clean start free of the issues surrounding RSS, has been adopted as IETF Proposed Standard RFC 4287 
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| Motivation 


@ The world wide web has a lot of information that changes 
frequently. 
@ There is no well defined publish-subscribe interface. 


e@ Publish : Post updates, and send to all the subscribers. 
@ Subscribe : Register to get updates. 


@ Polling based methods are not efficient. 
@ Corona provides a scalable method to disseminate updates. 
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So, what you see over here is a complicated, unreadable piece of text. But if I were to open the 
same in an RSS feed reader, then actually I would nicely see all of the updates in my feed 
reader, And So, a quick word about RSS if I were to show you the Wikipedia entry. So, RSS, 
the full form is a really simple syndication. And So, this gives us summaries of websites where 
it is one of the most popular Pub Sub systems, the websites, publish the news, and then we can 


subscribe to them. 


And then I will immediately get notified the moment that there is a change in a website. So, 
Pub Sub systems are not just limited to RSS, even the RSS is very common. Pub Sub systems 
are kind of there everywhere, where we have a publisher that posts updates. And we have a 
subscriber that registers to get updates. So, we just saw a example of RSS. But of course, this 
can be in any kind of system. So, how does this actually work? So, the way that this actually 


works is kind of interesting. Let us go back to our journal over here. 
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So, the way that this works, when we can have 2 methods, the first is polling. So, in polling, 
what happens is that the subscribers keep on asking the publisher, has Do you have an update, 
do you have an update and the other is kind of an interrupt driven mechanism, where the 
subscriber registers her IP address with the Pub Sub system. And whenever there is a change, 
the subscriber is notified. So, clearly, in an interrupt driven system, the Pub Sub system has to 
do the work it has to maintain state, whereas in a polling-based system, the Pub Sub system 


does not maintain any state. 
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So, the specific motivation for Corona other specific motivation for Corona is like this, that a 
polling-based methods are not very efficient, because they place a lot of load on the server. So, 
Corona, provides a very scalable method and very flexible method also to disseminate updates. 
So, let us say for example, I am interested in updates, sports updates. And let us say there are 
a million people like me, So, all millions of us should not actually be pulling the sports site, 
and actually trying to get updates. And also, so let me draw this over here. So, let us say there 


is a sports site, a website, and there are millions of subscribers. 


So, we have discussed 2 methods polling and interrupt based. So, the polling-based method, 
the millions of subscribers go to the site and keep seeing if there is so, me new news or that 
there is so, me change. And clearly this increases the load on the site. So, most cricket lovers 
would actually go to the site cricket info, whenever a match is going on, and we cannot actually 


see the match. And pretty much we keep on pinging it to see if there is a new status update. 


And the other is that actually the server essentially notifies the update every browser is 
maintained state. So, in that case, if the state changes, then each of the client browsers needs to 
be notified. So, this places this also places a large load on the server. So, the main aim of 
Corona is to place a degree of intermediaries over here, which minimize the load on the server, 
as well as stop the clients from polling. So, the intermediaries actually pull the server, but they 


minimize the load at the server. 


And instead of a client's polling, they actually send messages to their clients whenever there is 


an update. So, the intermediaries maintain state. So, what is Corona once again, well, if there 
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is a very popular website, and there are a set of clients, it is a lot of clients, then Corona is kind 
of like a middle middleman sitting over here. Where each of the corona nodes maintain state. 
So, it keeps track of the clients. Whenever there is an update, it disseminates the updates the 
client, and the nodes themselves, keep polling the website, but they do it in such a way to 


minimize the load on the website. 


So, the website is happy, the intermediate layer, which is Corona that we need to study. And 
of course, the clients are happy, because they quickly get the updates without, increasing the 
load on the server, because otherwise the server will crash. So, this is the basic idea the core 
idea of having a kind of a middle layer, So, me kind of an intermediate layer in a traditional 


Pub Sub system. 
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So, what is Pub Sub used for? Well, it is used for content that changes frequently. And for 
which we need to quickly know that the content has changed. So, these could be blogs, 
Wikipedia's new sites, but most common, it is like news, I would say that is the most common 


stock prices. 


That is also very common sports scores, updates. Cricket, for example, all day, if a match goes 
on for the entire day, for the entire day, we are getting updates. So, the current Solution is using 
RSS feeds, as I just showed you on types of India's website is called micro news syndication. 
And it is kind of based So, now there are many more sites that also give us short news briefs. 
So, many of these sites like in shorts and So, on, give us very, short news briefs, but all of these 


are based on very simple polling. 
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So, there is clearly a tradeoff between the latency perceived at the client and the bandwidth at 
the server. So, both need to be taken care of. So, one is minimize the server loads the other is. 
latency of updates. The idea here is that the load on the server of course has to be minimized. 


But also, the latency of updates for each client even that also has to be reduced. 
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Motivation 
Contemporary Approaches 


| Current Solutions 


@ Content providers impose rate limits based on-|P addresses, 
range of IP addresses. 
@ Servers ask the client to stop polling, or to change their 
polling intervals. 11 
@ Corona’s aim: 
@ Manage the server's bandwidth efficiently. 


@ Stay within the limits. j 
© Give the clients the best possible update latency. '< 


‘Smruti R. Sarangi Distributed Publish Subscribe Systems 


So, what are the contemporary approaches? The contemporary approaches is that So, me 
content providers impose a rate limit on the amount of bandwidth per IP address the number 
of IP addresses that can connect at the same time or a range of IP addresses. So, service can 
also ask the client to stop polling or change their polling intervals. So, that is why many of the 


popular news sites like including types of India or anything else that has a large viewership, 
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So, actually they go via intermediaries. So, one of the most popular intermediaries is actually 


a provider called Akamai. 


So, what Akamai does is that it essentially sits between Some very popular websites and the 
clients. So, whenever a client browser actually points to Some very popular website, it 
essentially hits an Akamai node. And his job was Akamai node to collect updates from these 
websites and give it to the client browsers. So, of course this process is hidden from us. So, as 
a client browser, we do not get to see it. But it is kind of still happening. That Akamai with sits 
in the middle that actually does it for us, which is that getting the updates latest updates from 


the server and giving them to the clients. 


So, Coronas aim is Something very similar. So, of course, Akamai is a proprietary Solution, 
Corona is in academia. So, it manages the service bandwidth efficiently stays within limits, and 
gives the clients the best possible update latency. So, of course, for clients, you can think of 
this as Something which can either be pulling or interrupt, but the relationship between the 
content aggregator or the provider like Akamai or corona, and the website is a purely polling 


based relationship. 
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So, how did the corona system which was done way back in the 90s? How did that actually 
work? So, users subscribed by sending instant messages to a registered Corona user ID. So, of 
course, instant messages are there everywhere. So, we have a lot of instant messaging protocols 
like Google Talk, and so, on, like Skype and Google and we a lot of instant messaging 


platforms. Also, we have the Software called game in Linux, which kind of aggregates all of 
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them. So, they send an instant message to a Corona user ID. Corona is a set of nodes on the 


cloud, that monitor is set of channels. 


So, essentially, every website or every content provider is called a channel. So, essentially, any 
service that generates an active feed an active feed of data, an active feed of news is called a 
channel. So, the corona resource allocation algorithm will essentially dedicate a set of nodes in 


Corona to monitor a given channel. So, they will do that. So, they will filter out useless content. 


So, this can include timestamps and advertisements and things that actually clients do not need. 
And then a difference engine will extract the relevant portions that have changed. And these 
differences will be distributed, the new updates will be distributed to the clients. So, this is all 
that Corona does, which means it first does pulling it pulls the channel, it extracts the 


difference. And then it distributes the difference to their clients. 
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So, no motivation. Now, we are kind of well-motivated. So, let us look at a few of the solutions 
and related work. So, the solutions and related work goal goes Something like this. We have 
publishers who post the content, we have subscribers who subscribe to relevant content. And 
so, this process of subscribing to relevant content can be done in 2 ways. One can be topic 
based, which means that publishers and subscribers are essentially connected by a set of topics. 
And each topic is a channel. So, the topic for example, can be food. So, whenever there is any 


update on food, it is given to the clients. 


So, this is not exactly how Corona works. So, Corona is content based, where subscribers can 
make queries on the well, let me take back my statement. So, Corona is actually channel based. 
And in this case, the channel is like a specific website or a specific provided, say topic, of 
course, can span across channels. In the sense multiple websites can be providing food. So, this 
is not exactly what Corona does is a single channel is kind of a subset of topic based. content 
based is again when we make Verizon the content. So, the content is not organized by topics 


but it is just like raw content. 


And then we essentially asked Corona to go through all the content and find the relevant ones 
that are that need to be sent to the clients. So, content based is not something that Corona does. 
So, what it does is actually a subset of a topic-based Pub Sub, where we get updates from 


specific from a particular channel. 


So, many of the research prototype Pub Sub systems require custom interfaces and they are 
difficult to use. So, one great advantage of Corona was that it is kind of compatible with all 


content providers. And it is rather easy to use it is based on instant messaging, instant 
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messaging is ubiquitous, it is available everywhere. And because of that the corona system is 


extremely popular. 
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So, we have already discussed RSS, what is it, it is short updates of frequently changing data. 
So, can be news stories, blogs, blog posts, So, they typically use an xml-based format to share 
very short updates. And we can access it via HTTP over standard URLs as we just saw, and 
there are dedicated RSS feed readers like aggregator. To display the data, so, then the server 
informs the client when to pull and when not to pull by using a certain tag called a cloud xml 
tag. So, the servers and RSS automatically or do some sort of rate control. And if let us say 
there is excessive polling excessive load at the server, the server's control the rate. But in the 


case of Corona, this is not required. 
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So, in the example of an RSS, xml snippet, So, the first mentioned the version of xml, the 
version of text encoding the RSS version 2. So, this is a channel. So, the channel of course, can 
have a lot of things. So, in this case, let us say I updated my homepage. So, then, So, my 
homepage will come here along with a link that has been updated the description of my 
homepage, and the particular items that have been updated. So, let us say I updated my teaching 
page, and I updated my research page. So, both the updates, So, let us say just change the 


descriptions, for example. 


So, both updates the update for the teaching, as well as the update for the research, both of 
them are showing up over here. also, for So, both of them that update for teaching and research. 
Show up are separate items in the RSS link in the RSS snippet. And a dedicated RSS feed 


reader displays these items to the reader. 
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Now, that we come to the design of Corona So, Corona stands for Cornell online news 
aggregator. Because this was initially a project from the Cornell University, the key features of 
Corona are as follows. The first is cooperative polling, which means we assign multiple nodes 


to pull the same channel and share updates. 


So, this is the same channel. And we are multiple notes. Then they pull the same channel and 
they share the updates with a much wider base of clients. And of course, the number of nodes 
that we assigned to pull the same channel depends on the popularity of the channel, the nature 
of the content, the size of the content and how many clients are interested. So, this is posed as 
a global optimization problem. And the honeycomb optimization toolkit is used to actually 


solve it. 
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Corona uses a pastry-based overlay. So, recall that in pastry, what we were doing is that we 
had created a large circular space, and we represented numbers in a given base. So, the first 
thing that we actually did so, consider a channel. So, for each channel, let us say a channel is 
uniquely identified by its URL from the URL using the SHA algorithm. We hashed it, it 


produced a hash value. So, the hash value was not exactly represented in binary. 


But it was represented in base b. So, recall a small clarification over here, in a pasty paper 
actually the bases 2 the power b. But in the corona paper it is not 2 the power b, the convention 
is that the base is b. So, we will go with this convention, the corona convention, even though 
recall that in the pastry paper, the connotation of b was actually different and the actual base 
was to raise to the power b. But as I said, in Corona, we will go with the convention that the 


base in which the hash is being represented is not base b. 


So, this hash is in base b. And for every such hash value, we can find a position for it in the 
array. So, the channel can be positioned over here. So, it is expected that, the match that will 
that it will have So, the nearest node So, the idea here also is the same that the closest node 
with the channel that is the owner of the channel and it is expected that the level of the match 
will be login to the base b, So, that is the expected value. So, if it does not happen, of course, 


we will see what is done, but this is expected to be the common case. 


So, what Corona actually does it or it defines a wedge around the channel, something like this, 
Something like a pizza slice. So, which means it logically splits the nodes on this. So, what it 


does is that for the circular ring, similar to pastry, it maps both the nodes as well as the channels, 
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the channels URL is hashed and mapped to the ring. And for the nodes, the IP address is hashed 
and mapped to the ring. 


So, once all of the ones all the nodes and channels, find that place on the ring, we can define a 
wedge for a wedge is defined as a set of nodes that share a common number of prefix digits 
with the channels identified. Also, if a channel has polling level I, this means that it is pulled 


by all the nodes that have at least 1 matching prefix digits with it. 
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So, let us explain this in Some more detail. Because this is by far the most important concept 
in Corona. So, both the channels as well as the nodes are mapped to the circle. So, let us say 
that the channel is mapped over here. So, then one of the nodes will be the closest to the 


channel. So, this node is made the owner of the channel. So, let us say this node over here, So, 
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this node is the owner of the channel. So, it is expected, the degree of the prefix match will be 
login to the base b. So, of course, if it does not happen, we still find the closest on it and we 


reduce the prefix match. 


Now what we do is that around each channel we define a wedge So, every wedge So, wedge is 
like a pizza slice. So, with every wedge, we define a polling level |. So, the polling level 1 
basically means that there is | matching. So, the URL matching prefix digits with the channel 
ID. So, let us say that the channel ID we are representing in base 16. So, let us say something 


like this. And let us assume by prefix we mean from left to right. 


So, if this is the channel, then for a given node if let us say it is a part of the wedge that is 
monitoring the channel. So, I will come back to what is monitoring the channel. But let us look 
at this for the time being. And let us say diverges from here on. So, in this case, the polling 
level is equal to the number of matching prefix digits. In this case, the number of matching 


prefix digits is five. So, the polling level is five. 


So, we can always define a wedge based on the polling level, lower is the polling level, the 
wider is the wedge, higher is the polling level, the narrower is the wedge. And of course, once 
the polling level process login to the base b, we expect the size of the wedge to be 0. So, at the 


beginning, the size of the wedge will just contain just the node and the channel. 


And then the polling level is expected to be login to the base b. But of course, if the channel is 
very popular, we would like a lot of nodes to pull for the channel. So, what we actually do is 
we partition the hash space in pastry by defining a wide wedge, how do we do this by reducing 


the polling level, such that we have a wide wedge over here. 


And the magic of Corona comes here, that all the nodes within this edge pool for the channel. 
So, what is the magic of Corona? Well, the magic of Corona the key, key, key idea is that all 
the nodes. Is that all the nodes in a wedge poll the channel. So, if you just want more nodes to 
poll the channel, well, all that we have to do is that we have to reduce the polling level. Once 
we reduce the polling level, we will have more nodes that it will span. And the more and the 
increased number of nodes will actually come and pull the channel. And let us say the channel 
loses popularity, then all that we need to do is that we need to increase the polling level, the 


wedge will shrink. 


So, let us say the wedge might become something like this. And then a fewer number of nodes 


will poll the channel. So, on a similar line, we can have different channels or different parts of 


650 


the (())(32:34) for each of them, we will have a wedge that is defined. Also note that in this 


case, wedges can be overlapping. 


So, in this case, let us see if there is another channel over here, let us say something like this, 
we can have, let me use a different color. Let us see if I can do that in settings. I can. So, let us 
say the other channel over here, which is this channel, then essentially, I can define a wedge 


for it, which would look something like this. 


So, here kindly note that the 2 wedges are overlapping. So, nodes in the overlapping wedges 
would actually pull for both. So, Corona does have a way of managing such conflicts in a very 
indirect manner. But we will discuss that later. As long as the key concept is clear. And the key 
concept over here is the concept of wedges, which are essentially partitions of the hash space. 
So, the concept of wedges is the most important. And it is essentially a partition of a hash space. 
And that is clearly the most important concept here that we can either have a wide wedge or a 


narrow wedge, depending on the polling level, which can be changed dynamically. 


So, the entire optimization problem within Corona in Corona the entire optimization problem 
is to essentially figure out the wedge sizes for each other. So, that would talk about the server 
load that would talk about the load on the corona nodes. And that will also determine the update 
latency of the client. So, all of this related to the web size, well how So, if the web size if the 
wages are very wide, this will mean that we are getting a lot of updates from the server the 


update latency will be low. 
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So, I can maybe summarize this over here. Let us say a wide wedge. Low update latency 


because a lot of nodes are pulling, and they can also pull at random intervals. So, with that, 
what will happen is that, if there is any change, it will quickly get caught by one of the nodes 


in the wedge, it can then distribute the updates to the rest of the nodes. 


So, wild west will have a low update latency and a high server load and vice versa narrow 
wedge will have a high update latency and a low server load. So, we need to find this and we 
need to So, there are conflicting requirements of the update latency and server loads, we will 


see how Corona solves it. 
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that have at least / matching prefix digits with it. 


@ A wedge associated with a channel polls for it. 


‘Smruti R. Sarangi Distributed Publish Subscribe Systems 


652 


So, we understood the notion of a wedge the common number of prefix digits, a channel has 
falling level 1, if it is polled by all the nodes with L matching prefix digits with it, which means 


all the nodes within the wedge. 
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So, a little bit of math over here, let the total number of nodes be N, a channel with polling 
level 1 will have on an average N/ b! nodes in its wedge. So, I am not proving this instead, I am 
directing the viewers to the pastry paper that proves this in great detail. So, this paper has to be 
read as a prerequisite before looking at before leaving this lecture. So, we will assume that this 
is the this is kind of a gospel truth here. It should be T. So, let tau be the polling interval. So, 


the average detection time for updates, so, the average detection time. So, if let us say we call 


it that, so, let us give it this name, det time. 


So, the det time is clearly proportional to the reciprocal of this are inversely proportional to the 
number of nodes in a wedge higher or the number of nodes in the wedge. So, assuming that 


they are pulling at random intervals, which is the case, the detection time will be inversely 


; ; » . BF ; 
proportional to it. So, we can see it is a? Furthermore, the detection time will also be 


proportional to the average polling interval, which means that for every server if it is polling, 


once every tau seconds detection time will be proportional to that. 


And in addition, given the uniform distribution that you are assuming here, which means that 


let us say that if a server is polling, and between time 0 and T, it could have been changed, the 


data could have changed at any point in time, the average latency will be . which is easy to 
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ee ee b! ; 
prove. So, this gives us the average detection time for updates to be = ae And the collective 


node placed on the server is clearly proportional to the size of the wedge, which is a So, the 


problem is to estimate the polling levels of each channel or alternatively, the sizes of the 


wedges. 
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So, let us now discuss a few of the corona schemes. So, the first is Corona lite, which is also 
the first game that comes to our mind when we think about this problem. Intuitively, this 
ensures good update performance, while ensuring that the load on the servers is light. So, let 


us first look at the terminology. 


Let M be the number of channels. qi be the number of clients for channel 1, write is important 
q is for number of clients. For channel 1. li is a polling level of channel i, N is the total number 
of nodes. And si is a content size for channel 1 of course, appropriately normalized this is 
important. And this is not something that the paper stresses on was important for me to stress 


that it has to be appropriately normalized. 


So, it is the normalized content size for channel 1. So, what are we trying to do here? Well, 


’ : . pe -— pli 
what we are trying to do is that we are trying to )}/" qi ra So, let us see what this is. So, 


WV: 
Just let us just go to the previous slide. So, this is proportional to the update time the update 
detection time. It is proportional to this quantity. So, we say that this is the update detection 


time. And this we are kind of waiting with a number of clients for channel i. 


And some moment, we want to minimize this quantity, which is the weighted sum of the update 
detection time. Let me repeat that this quantity is the weighted sum of the update detection time 
where the weight is the number of clients for channel 1. We want to minimize that. So, what 
effect will this have? Well, the effect that this will have is that for any channel that has a large 


number of clients, we would like to minimize the update time as much as possible. 
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So, essentially, for popular channels, we want to deliver updates as quickly as possible. And 


we essentially want to penalize less popular channels. So, this is of course, objective function, 


what is the constraint? The constraint is, so, ple which is the same as this quantity over here is 


essentially the expected size of the wedge. So, this is the expected size of the wedge. And this 
is the size of the content of the channel. So, if I multiply the content with the size of the wedge, 
this gives me an idea of the bandwidth requirement from the server to kind of supply data to 


all the corona nodes. 


So, this is an estimate of the bandwidth of the server. So, if I were to add up the collective 
bandwidth are all the servers for all the channels. So, the collective bandwidth for all the 
channels has to be less than equal to the number of clients. And this is like kind of a bandwidth 
constraint. The reason that there is a bandwidth constrained is that, look, if this was equal to 
the number of clients, we did not need Corona. So, one main advantage of Corona is that we 


want to pretty much reduce the bandwidth at the bandwidth requirement at the server. 


Because otherwise, the server could just provide it provide data to all the clients, we did not 
need Corona in the first place. So, we keep this as a constraint that the normalized content size 
multiplied, of course, with the size of the wedges, the sum of that across the channels should 


be less than equal to the total number of clients for all the channels. 


Switch, if you see from a common-sense point of view also makes sense, because this kind of 
justifies Corona. If we did not have this, then pretty much, Corona, what would happen is that 
in the quest of minimizing the update time, the bandwidth requirement from all the servers 
would be more than what a non-Corona system would require, which will not make any sense 


at all. So, given this kind of a common-sense constraint, we tried to minimize the update time. 
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So, here, what happens is the clients are popular channels, needless to say, gain a lot. Because 
for them, the average update time gets reduced that to significantly. This nicely partitions the 
bandwidth across the channels. The update performance, of course, would vary depending upon 
the type of the workload because the type of the workload is not being considered here, very 


explicitly. And for less popular channels, they kind of suffer in this case. 
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So, then we define Corona fast, which is expected to be the fastest. So, I will explain in a 
second, why the terminology remains the same for M is the number of channels. qi is the 
number of clients for Channel 1, li is the polling level of channel i. N is the total number of 
nodes. si is the content size, and T is a certain constant, it is a performance target. So, what am 
I trying to do over here I am trying to minimize this quantity, what is this quantity? So, this this 
quantity is exactly the same as this quantity. In this case, si is the content size, as I said 


normalized, multiplied by the size of the image. 


So, in this case, what I am trying to do is I am trying to minimize the bandwidth, the load placed 
on the content servers and trying to minimize that subject 2. So, this is interesting, this quantity 
is the same as this for Corona light, which is the. This is the weighted update time. So, this is 
essentially the update time weighted by the number of clients. So, we want to keep the update 
time less than a certain threshold. So, there are 2 aspects to this threshold, lis a constant, and 


the other is the total number of clients. 


So, of course, the update time should be a function of the total number of clients. Because as 
the number of clients increase, what that would essentially do is that that would place a higher 
load. And here what we are saying is that, look, if the total number of clients are increasing, 
then the system is expected to be less responsive, the update time should increase. And that is 
kind of getting captured over here. And T is just a factor for normalization. But the key aim 
over here is that look, I want to achieve a target update time and the target update time is should 


be less than a threshold. 
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And the threshold is proportional to the total number of clients across all the channels, which 
also it should be that is a less clients, I should expect a lower update time. And if there are a 
lot of clients, I should expect a larger update. So, Corona fast is clearly the fastest when it 
comes to update time. The question is why? So, an astute reader can always ask that look for 
Corona lite, you try to explicitly minimize the update time. Whereas in Corona fast, you try to 
minimize the bandwidth. So, from a common-sense point of view, Corona lite should give the 


fastest update time when Corona fast should not. Well, that is true. But that is not the case. 


In this case, we are like kind of forcing the update time to be less than a certain value. So, this 
is like a knob, which is totally under our control. And we are turning the knob to a point where 
update time is the lowest possible time that is possible. And in this case, the bandwidth is kind 
of going free. So, we are trying to minimize it. But think of the update time has been constrained 


and the bandwidth has been free. 


So, we can lower this as much as we can, you know to achieve kind of any target within a 
feasible range. And once we have kind of set up our mind on this, then we can of course, So, 
once we set up this threshold over here, which is the constant T, which is the most important 


that sets our target. 


Once the target has been set, we minimize the bandwidth. So, clearly, this is totally in our 
hands, we can make corona as fast as possible. Whereas in this case, the target is the bandwidth. 
And then of course, space for reducing the update time is kind of constrained. That is why 
Corona fast, as the name suggests, is genuinely fasted in terms of update time, as compared to 


Corona lite. I would ask the viewers to go over this logic several times till they are convinced. 
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So, this of course bounds a total amount of network traffic that it indeed it does. So, well it 
does not bound, if you would actually see it actually tries to minimize I should maybe change 
this it allows us to tune the update performance per application, which is just what I said, given 
an application we can fix a target. For example, for a stock market application, we might choose 


a very fast update performance. 


So, along with providing application, the desired level update performance, it tries to shield 
web servers or spikes in load reducing the band. So, of course, the negative aspects for both of 
these protocols is that both Corona lite as well as Corona fast. Do not consider the rate of 
change of objects in the channel. In the sensor load the update time only makes sense in the 
context of the rate of change of the object. So, the object is not changing. There is no point in 


actually polling the channel. So, Corona Fair takes this into account. 
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So, it introduces a few additional variables. So, recall that this structure qi b and N is similar to 
Corona lite qi b and N. So, this is kind of Corona lite plus plus where this is the normalized 
update time qi b li N multiplied with tau which is the polling interval they should have been 


there and also the polling interval can be different for different channels. 


This of course, this variant of Corona does not consider, but this is something which can be 
considered and the other is the update interval for channel 1 because for different channels, the 
update intervals will be different. For example, for stock prices, the update interval will be low, 


particularly when trading is happening, some stocks will be changing very quickly. 


But once the stock market has closed for the day, the update interval will actually be very high 
right, because the stock prices are not going to change the stock prices will only change when 
the market reopens. So, any kind of a Corona polling should take this into account and the rate 


of change of the content is what is coming up over here. 


So, let us say that the rate of change is very high, then ui will be low. So, this constant will be 
high, if this constant will be high, then essentially, we want to minimize the update time. But 
let us say that we have a very, very slowly changing channel ui will be high, this fraction will 


have a low value. 


So, then of course, we can afford to have a high update time for the channel. And this is our 
regular bandwidth constraint, which we had in Corona lite, if I were to show you, So, is the 
same constant that we had over here, which is what we have over here. So, this remains the 


same, the only addition is the update interval. 


So, we are using 2 terms here, update interval and update time. So, update interval is a property 
of the channel and the channel only. This is ui. And the update time is a time that a client 
perceives between the actual update to the channel happening and the client actually seeing it, 


that is the update time. 


So, the update time recall the keep in mind is different concept and it is what is perceived by 
the client. So, of course, it can be one criticism of this formula is that for very very slowly 
changing channels to update time can be very large, we just kind of unfair. So, we are basically 


giving too much importance to ui to kind of reduce the importance in ul. 


What some, so, 2 other variants of Corona what they do is fair square root actually consider 


the square root of this quantity to kind of temper down the effects of ui. And if you further 
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want to temper down the effects of ui, instead of the square root we consider the log. So, that 
will further kind of reduce the effects of a large or small ui. And So, we consider the ratios of 


the log. 
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So, as we discussed, Corona fair is similar to Corona Lite Corona fair, of course has 2 versions 
where square root and corona log. It also tries to minimize the update detection time, which we 
call the update time in the lecture. So, the Call Update time is different from Update Interval, 
just go back to 2 slides back. 


So, it minimizes the weighted update detection time with a limit on the total amount of traffic. 
It also introduces a term to reduce the number of allocated servers, the date of updates is small, 


which we have been calling as the update interval. Furthermore, it is possible to dampen this 
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term damping by considering the square root the ratios of the square roots are the logs as we 


just saw over here. 


(Refer Slide Time: 54:19) 


Corona Design Corona Schemes 


| Decentralized Optimization 


@ Core optimization problem: 


3 


/f) and(gj)are the performance or the bandwidth cost of the 
channél at polling level |). 


@ The values of /; are integers. 
@ This problem is NP-Hard (need to compute a fast approxi- 
. — 
mation) 
@ Honeycomb finds a solution inO(M log M log N)time, which 
is Optimal for M channels. 


‘Smruti R. Sarangi Distributed Publish Subscribe Systems 


So, this can be posed as a core optimization problem. So, the optimization problem, in this case 
will be something like this, that we want to minimize the quantity f subject to some other 
quantity being lower than the threshold. So, these can be kind of expressed as generic functions 
as f and g where we want to minimize f subject to G being lower than a threshold. If fi and gi 
are the performance, or the bandwidth cost of the channel at polling level li, so, it does not 
matter what these are, but, for different variants of Corona, f and g will be different functions, 


then we can have, you can set up an optimization problem. 


And if you would see the corona paper, it proves to the problem is NP hard. But, of course, we 
can find good approximations to it, where m being the number of channels N being the number 
of nodes, we can find good approximations within O(M log M log N) time. Honeycomb can 


do that. 
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So, how does honeycomb actually do it? Well, the honeycomb, what it does is that it combines 
channels with similar tradeoffs, where the tradeoffs are defined as the fng functions into a 
tradeoff cluster. channels with similar behavior are nodes and channels, similar behavior are 
clustered. And the nodes periodically exchanged these clusters. And furthermore, nodes locally 
run the honeycomb algorithm to figure out the assignment of nodes to channels. And of course, 
disagreement regarding the assignment of nodes to channels can be a problem with this, of 


course, does not happen. 


But the key idea over here is that there is a distributed mechanism of solving the optimization 
problem. So, it is not that one central site solves it. Instead, what happens is that if let us say 


this is the wedge, each of the nodes individually solves the problem. And each of the nodes 
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individually can on its own volition increase or reduce the polling level. For example, let us 
say if this node perceives that it needs to reduce its polling level then what can what it can do 
is that it can reduce the polling level on volition and consider a bigger part of pastry. bigger 


part of the pastry ring. 
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@ Aggregation Phase : Nodes receive tradeoff data from other 
‘peers. 


Smruti R. Sarangi Distributed Publish Subscribe Systems 


Now, some other systems management issues. So, let us just go over the entire process of 
running the corona system. So, each channel in Corona hashes its contents to catch this unique 
pastry key this we have seen, it is assigned an owner node the same way that pastry assigns it. 


For added fault tolerance, a key is assigned to F succeeding nodes. 


So, if you would recall, we had seen something very similar in Amazon dynamo, very key is 


not just assigned to one node. But it is actually assigned to a series of nodes F successive nodes. 
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So, is that if one fails, the rest can take over. Owners receive subscriptions and send updates to 


all the subscribers. 


So, what are the owners do owners are the one who receive all the subscriptions, and then they 
send updates to all the subscribers, subscribers typically. And also, we have cooperative 
polling, which means that all the nodes within which they poll cooperatively and share updates 


among each other. 


So, there are three phases in cooperative polling. And this is cooperative polling, as well as 
independent polling, in the sense that nodes can independently increase or decrease the web 
sites. So, the first is the optimization phase, where a node applies the honeycomb-based 


optimization technique to the traffic data that it has collected from the servers. 


So, any change to the polling level is communicated to the peers. And of course, nodes can 
change that on their own and nodes and then So, there is the maintenance phase. And then the 
aggregation phase is when nodes receive trade off data from other peers. So, that again, they 


can run the next round of optimization. 
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So, initially, the node at the owner node at level K = (log Nj ). Polls for the channels. Any node 
in the maintenance phase, which you will recall is a second phase over here might decide to 
reduce the polling level to k - 1 In this case, a small wedge will form that will perform 
cooperative polling. Whenever there is a change in the polling level, some nodes need to be 


instructed to start or stop holding which a node will take care. 


Owners will typically monitor the status of all the nodes in the wedge and aggregate all the 
maintenance messages. at the end over here in the aggregation phase. whenever there is a 
failure, Corona will remove the nodes from the ring and on the addition of a node, Corona will 
add it to the ring is the same as pastry. And if let us say the owner fails, then the subscription 


state of the owner is deleted, and a new owner will take over. 
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| Update Dissemination 


@ Corona has a dedicated difference engine that computes 
the difference between different versions of a file by polling 
the server. 

@ It only sends the deltas (differences) to other nodes in the 
polling wedge. 

@ Each new version of a file has a unique version number. 


@ When a delta is generated by a node, it shares the delta 
with all the other nodes in the wedge. 

@ If a node cannot reliably get a timestamp from the server, 
then it sends the delta to the owner. The owner assigns a 
timestamp and multicasts it. 3g 
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User Interface 


So, Corona in this case, well, first, I think it is important for me to describe the way that the 
corona DAG will form. So, the way that it will form is something like this. That if I were to 
consider the pastry ring, So, initially, we will have Seville other channel here and just the owner 
node, it will be a very narrow wedge what will happen is this will be essentially the when the 
channel is being added. So, after that, of course, the servers will be polled and the node will 
get an idea of whether it should increase or decrease, holding level and aggregation and 


optimization phases. 


So, let us say decides to increase it is polling them. Say we decided to increase this polling 
level, this is the size of the new wedge. So, this means that if the original owner was node, O, 
then it will point to NI, N2, N3 which are there N s wedge. Here is the interesting part of which 
might be hard for review for readers to readers or the pastry and corona papers to appreciate at 
the beginning. But this is something that needs to be explained. So, what each node actually 


does in pastry is that it maintains a routing table. 


And in this routing table for every prefix. There are a set of nodes that match the next digit. So, 
some of these nodes are kind of capped in the routing table. But the important point to note is 
that these are not the only nodes that have 1, prefix digits matching with their binary many more 


nodes, but basically does not keep a list of all the nodes. 


So, recall that this is the most important point hence I am repeating it once again, that let us 
say that between the ID of the channel. So, let us say that these are all the digits in ID of the 
channel. And let us say that will let us consider all the nodes. So, let me first consider the owner 


node that has let us say log in matching groups. 
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And now let us assume that it has reduced the polling level to (logy — 1). So, when it reduces 
the polling level to any 1, which in this case is (logy — 1), it will essentially go to a node, it will 
actually go to a row in its routing table. Where it will basically see that look, show me all the 
nodes which have | matching prefix digits. And then depending upon the next digit, there will 
be a set of nodes which will be N1, N2, N3. But recall that the list is not exhaustive. There 
might be many more nodes, which also have | matching prefix digits, but which do not show 


up in the routing table. 


Because the routing table typically, at least in this case does not store multiple entries. Now, 
assume that node N2 on its own decides to reduce the polling level to (1 — 1). If it does that, it 
will again point to a set of new nodes which are N4 N5. But given that N1 and N2 also match 
in the first | digits they will also match in the first (1 — 1) digits. So, one of them N1 is showing 
up N2 routing table N2 will also point to N1. And so, what we will essentially and then again 
NS might decide to again changes polling level it might end up pointing to N3 and maybe a 


few more nodes. 


So, what we will actually have is that we will have a directed acyclic graph or a DAG and the 
reason will have a DAG is because the nodes independently choose to either increase or 
decrease the size of the polling levels. And this is what gives us a directed acyclic graph kind 
of structure. And the moment N5 or N2 and N1, anybody reduced the polling level, they will 
cut off edges and vertices from the DAG. So, of course, we will have an overall owner, and the 
overall owner will only know its children. And N2 will know its children N5 will know its 


children. 


So, any update that is disseminated by the owner that pretty much and of course, any node will 
also know who its parent is in the DAG. So, let us say N7 finds an update, its job is actually to 
propagate the update all the way up. There can be different variants of this, but the most 
common variant would be for N7 to propagate the update all the way up to the final owner. 
And the final owner will then again disseminate the update to all of the child nodes So, that all 


the child nodes get the update. 


So, this is pretty much what is being said over here. And of course, the owner dies, a new owner 
comes up, it will take control of the det, and maybe delete the subscription state or the new 


owner and requests for new subscribers. So, a lot of things are possible over here. 


671 


Corona has a dedicated Difference Engine that computes the difference between different 
versions of a file by polling the server, it will only sell the deltas or the differences to other 
nodes in the polling wedge. Only if it sees it. it does not send the entire file only the deltas. 


And each new version of a file is given a new version number. 


And definitely it will share the Delta with other nodes in the wedge. And also send it back up 
to the owners. This is not what i am mentioning here. This is what is mentioned of what I just 
mentioned, previously a slider. So, sometimes it is possible that a node is not in a position to 


determine if the current version of a site is new or old. 


So, it cannot get a reliably get a timestamp from the server. Then what it does is it sends the 
data to the owner, which is supposed to be a repository of all the past versions of the site. So, 
the owner thinks that the timestamp is different, then it assigns a timestamp to the new version 


of the site and it multicast sick. 


So, essentially, the new update flows through the entire DAG. So, the owner of the wedge, the 
original owner still has a lot of role to play because it kind of owns the DAG. And it also 
decides if an update is new or not. It does that and also it propagates the update to all the nodes 


within the DAG. 
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@ Users need to add Corona as a buddy in their IM system. 
@ They can then subscribe or unsubscribe to an URL by send- 
ing a message to Corona. 
@ A subscribe messages is routed to all the nodes in the 
polling wedge for that particular channel. 
@ When an update is detected by the owner of the channel, it 
is sent to all the subscribers through the IM system. enternupl, 
@ IMsystems typically allow peer to peer communication such 
as Skype. 
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How does the user interface work, So, users need to add Corona simply as a buddy as a friend 
in their instant messaging system. Subscribing or unsubscribing to a URL is as simple as just 
sending a message to Corona all the subscriber messages are routed to all the nodes in the 


polling wedge for that particular channel. Whenever an update is detected with the owner of 
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the channel, it is sent to all the subscribers through the IM system. So, here polling again is not 
required. This is kind of like an interrupt driven mechanism where you have consistent 


connections with the owner of the channel. 


So, this kind of rates the server or maintaining the state Corona maintains the state like any 
good content provider and IM systems will typically allow such kind of a peer to peer 
communication with persistent network connections in something like what similar to what 


Skype does. 
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Evaluation 


| Implementation 


@ Uses a standard Pastry implementation, 160-bit SHA-1 hash 
function —- 

@ Occasionally, it is possible that the size of a wedge might be 
zero 

@ We need to then adjust the sizes of the clusters 

@ Corona interacts with IM systems using the instant messag- 
ing protocol-GAIM 

@ At the moment, the entire Corona system is trusted] 

@ The evaluation is on a large scale deployment of Planet- 
Lab(large scale distributed cloud). 

@ Used a micronews feed collected from real life workloads. 
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And So, how was this entire system implemented? Well, a standard pastry bit pastry 
implementation 160-bit SHA one. Of course, we have been alluding to the problem that the 
size of a wedge at the beginning might be 0. Then of course, we need to adjust the sizes of the 
clusters or the wedges by just reducing the polling level such that the wedge is not 0. Traditional 
instant messaging systems like GAIM are used to talk to the owner establish a connection with 


the owner. 


So, security and trust. Well, a lot of emphasis was not given to it. We assume that all the nodes 
in the corona system are trusted. And the evaluation of course happened on a large cloud of 
machines on the planet lab cloud fair essentially micro news feeds were collected from real life 


workloads. And the same was simulated on a Corona based environment. 
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So, the system had 1024 nodes and 100,000 channels. So, this is kind of a very large system 


actually. So, with 100,000 channels and 1000 nodes and 5 million subscriptions, this is even 


much larger than many of the commercial systems that we see. But that was also the aim to see 


how large How much does it scale. The polling interval was set for 30 minutes and the 


maintenance interval for 1 hour, every 1 hour. So, there was a reconfiguration, and all the 


variants were compared Corona Lite, Corona Fast and Corona Fair. 
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So, for all the variants, what we actually get to see is that for the legacy RSS based system, for 


which we are comparing against the average update detection time, So, everything here is 


measured in seconds, the average updated update detection time was 900 seconds with the 
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average load. So, let us say the average load, let us look at that in terms of arbitrary units, that 
was 50. So, Corona lite, which of course, tries to minimize the update detection time subject to 


a constraint on the load. 


So, that immediately brought it down to 53. And the load also remained kind of similar to 50, 
so, the aim was never to increase the load beyond what a legacy RSS system would produce. 
So, it brought it down 48.97 It is 53 If I were to compare Corona fast, which you have argued 


that it will be faster. 


So, it brought the average update detection time to 32 and the average load to 58.75. So, of 
course, Corona fast is the fastest and we can see it is very fast, but it increases the load at the 
service not much though Corona fair is something that considers the also the, the rate of change 


of the content itself. 


So, even with a moderate load, the average update detection time was very high, the reason 
being that, it gives too much of emphasis to the rate of change of content to that time, but once 
that was kind of tempered down, and it was kind of made fair that even slow-moving content 
will at least get some servers. So, with this fair regime, with both the square root and the log 
schemes, the update detection time came roughly to the same ballpark as Corona lite, albeit 


with much more fairness and the load also remained very similar to the baseline RSS load. 


So, what we see is that if we are looking at just straight forward update detection time we go 
for Corona fast. If we are looking for fairness, as well as a reduction in the server load, the best 


options are Corona fair square root and Corona Fair log. 
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fa Corona: A High Performance Publish-Subscribe System for 
the World Wide Web, Venugopal Ramasubramaniam, Ryan 
Peterson, and Emin Gun Sirer, NSD! 2006 
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So, the corona paper is around 14 years ago, it is a very classic paper, and almost all future Pub 
Sub systems have been built on the lines of Corona So, it is kind of provided a template for 
future Pub Sub systems. It was published NSDI 2006. Readers are most welcome to read the 


entire paper and comment about this video. 
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In this lecture, we will discuss the Facebook Cassandra system. The Cassandra system is a key 
component of Facebook that is used in the inbox search. So, we will discuss, what is inbox 


search? Over the next few slides. 
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Of course, this does not need any introduction, because almost all of us are very active 
Facebook users. So, all of us know what an inbox means. Facebook has a messaging system 
between users. So, it is possible for a user 1 to send a message to user 2 that appears on user 


2's dashboard. 


So, active users can also get hundreds of messages per day. And it is the job of Facebook to 
store all of this data in a database of sorts. And furthermore, it should be possible to query these 
messages based on keywords. And based on the recipient, and basically based on either who 
the message is being sent to, or who the message is being, or either the id of the sender or the 


id of the recipient, let us call it the sender or the recipient id. 


So, actually maintaining an inbox search for a billion plus users is a rather difficult challenge. 
So, that is why the Cassandra system was developed. To solve this particular problem, we can 
think of it as a logical extension of the chord and Dynamo projects. So, that is why it is 
important to have at least an idea of chord, the chord DHT before proceeding through this 
lecture. And also, if the viewers have an idea of the Amazon Dynamo project, it would help. 


So, Amazon Dynamo is described in one of the previous lectures of this lecture series. 
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Motivation 


© Social Networking + Facebook 
@ Facebook is HUGE | 


@ Billions users 

@ Millions of updates per minute 

© Many thousands of servers [ Mut: ple dale c nlod| 

© Reliability isabigissue / 
@ Cassandra: Made for the Inbox Search problem 

® Allows users to search through their Facebook Inbox 


@ It is also used for many other Facebook services 


So, nowadays, of course, social networking means implies Facebook, Facebook is the de facto 
social networking site. There used to be others like Google Plus and orkut, but they are not 
there anymore. Facebook, of course, is huge. It has billion plus users, millions of updates per 
minute. In this case, millions of messages are sent between users every minute, users post 


photographs, users write posts. 
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So, that is the reason internally, similar to other large systems that we have seen. Facebook will 
have 1000s of servers that are physically partitioned across multiple data centers. And we will 


need to say, in such a large distributed system, reliability is a big issue, massive issue. 


So, Cassandra, was made for the inbox search problem. It allows users to search through their 
Facebook inbox. And it has also internally been used to operate many other Facebook services. 
So, as we have been kind of seeing with Google percolator, and Amazon Dynamo that many 
of these systems are designed to be generic. So, I said at least the company that is designing 


them, it can use this software for other things as well to implement other services as well. 
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So, a bit of related work, we have already looked at distributed file systems, we have looked at 
Coda and AFS. So, a distributed file system can definitely be used to implement an inbox, 
where pretty much every message can be written to a file, or a set of messages can return to a 
file. And then, so, even though this does appear kind of appealing, the main problem over here 


is that inbox is not really a file system. And this causes an issue. 


Furthermore, file systems have a hierarchical namespace, and they have specialized conflict 
resolution. So, we do not need all that these things in an inbox search, all these things are not 
really required. And so that is the reason using a traditional file system to implement an inbox 


is not really a good idea. 


On similar lines using a distributed database like Bayou, that is also not a good idea. Because 


a database has actually far more structure. So, it is okay to store airline tickets, for example. 


679 


But it is not really a good idea to implement an inbox using a database. We have seen 2 other 


projects we have seen Dynamo, Amazon Dynamo. 


So, this is the system that Amazon uses to store its shopping carts. So, in this case, Amazon 
uses vector clocks to manage the versions. And whenever there is a get, we actually retrieve 
the entire context, which is pretty much what we get from a get message is actually that we get 
all the versions and then either the system or the user, they need to perform an automatic merge. 
The conflict resolution is based on the vector timestamp, specialized business logic. And of 


course, as we had discussed in Amazon, it will be the same in Facebook also. 


Membership changes to a ring, are kind of infrequent. So, this is the same in Facebook, that 
whenever so we do not regularly add or remove servers. These are relatively infrequent 
operations, some membership changes are rare. So, that is a reason even in Dynamo we had a 


console based system. And even in Cassandra, also, we have a slow console base system. 


But the main problem with Dynamo is that it is great for shopping carts is great for kind of very 
structured objects. But inbox is a very simple thing. And it is and for this, what we need 
actually, is we need a very high write throughput. Of course, latency does not matter that much 


for light for writes. 


But we need a very, very high write throughput. And so basically, that is a prime goal. And 
Dynamo is not really designed for a very high write throughput, but it is designed for actually 
very good write latency, and also for aggregating multiple versions of an object. So, it never 
loses a write. We have also looked at a little bit of Google Big Table in the percolator project. 
So, big table is pretty pretty much a multi dimensional table where every row is indexed by the 


key. So, basically, it is a think of it as a key value store in that sense. 


So, one problem is that percolator sorry, Bigtable relies on GFS Google File System for the 
underlying implementation. And the Google File System, what it does is that it divides the 
large table into chunks, and stores these chunks. So, we do not want structure such a strong 
relationship with the underlying file system, because that also adds its own overheads. So, that 


is the reason there is a need to create a bespoke system over here, a custom system over here. 
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And of course, Facebook is large enough that it can afford to create a separate, fully designed 
system from scratch to store in Facebook inbox messages. So, we will define a couple of terms 
over here that will be used pretty regularly over the next 18 slides or so. So, first, the table, 
again to Google big table, it says distributed multi dimensional map, which means that it is a 
very, very large table. So, we can think of let us consider a 2d table to start with. So, every row 
can actually have a large number of columns. So, in a certain sense this table is partitioned can 
be row wise can be column wires can be block wise, and they are physically stored at different 


locations. 


So, the physical storage is at different locations. And so, it is so this table is essentially stored 
in a distributed fashion. And of course, each slide a key value store for every row has a unique 


key, which in this case is a 16 to 36-byte long string quantity to identify a row. And a value, of 
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course, in this case is a highly structured object, we will see how and why. And furthermore, 


similar to big table, we assume that every row operation is atomic on a given replica. 


So, this is important, every new operation is not atomic on like all replicas. But for a given 
replica, it is atomic. So, a given replica will never have a mixed row state. And for the columns, 
which we can have many, the columns are grouped into families. So, we can define a super 


column that essentially contains a lot of child columns. 


So, this is possible to do, and a column themselves can either be sorted by time, time of 
modification time of creation, or by name. So, it is important that the columns are not just one 
hierarchy, but in the columns have a hierarchy within them. And so, we can group columns 
into families and we then we can create a family within a family. And so that would pretty 


much be a super coloumn. 
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® get (table, key, columnName) 


@ delete (table, key, columnName) 


So, the API is a rather simple, the API is pretty much as simple as it gets. So, it is similar to 
broadly what we have seen in percolator, which is that inserting into the table is rather easy. 
We give the table id the key which uniquely identifies the row. And in this case, as we have 
been seeing in distributed systems up till now, anytime we create a row, we actually create a 


new version of it. 


So, the row mutation basically means that we are creating a new version of the row. And get, 
well it is the same thing, but we get a certain column name. And we can also delete a certain 


column or we can delete the contents of a certain column. So, insert get and delete is a very 
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standard way that almost all of these tables operate. That includes percolator dynamo, big table, 


everything we have seen up till now. And the interface in that sense is rather standardized. 
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And one reason for having a standardized interface is that it is useful both for inbox search, as 
well as Facebook can use it for other applications as well and at some point of time. Well, so 
some versions of Cassandra are available publicly. But let us say at some point of time, if 
companies decide to open source such kind of systems, developers can use them to build a wide 
variety of applications, which can benefit the host company as well, that is also one reason why 


open sourcing internal projects is typically a good idea. 


So, in this case, Cassandra has a cluster of nodes, it is a distributed system, a read write request 
for a key will get routed to a node in a cluster. So, this cluster is of course, organized as a DHT. 
And this is the DHT that they use as called. So, that is why I said that chord is a understanding 


chord knowing chord is a prerequisite for this lecture. 


And so similar to chord, what we do is that given a key, we map it to the hash space, then we 
traverse the hash space clockwise till we find the nearest node. And then of course, we maintain 
replicas of the key value in the next n nodes subject to the fact that they are in different data 


centers. 


So, then, of course, similar to chord we determine the replicas that contain the key. And among 
the replicas, we define the write quorum, and then we define the read quorum. So, write 


quorum, if you recall, let us say that there are 10 replicas, and the size of the quorum is W size 
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of the read quorum is R, then we had always said that r + w > 10. And which will basically 
ensure that there is at least an overlap between the read and write quorums. And so, in this case, 


it will mean that we will not miss an update. 


So, for a read either we read from the entire read quorum and somehow merge what we were 
doing in Dynamo. But of course, Cassandra does not discuss the merging aspect. Or we read 
from the closest replica, the geographically closest replica, or the one that we find the one that 


is available. 
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The details, so, well, as we discussed, we partition data using a consistent hashing algorithm 


like chord. And chord, if you would recall was using virtual nodes for load balancing, we do 
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the same over here. So, this kind of ensures that the load per physical node is kind of balanced. 


And there were some formulae in code that discussed virtual nodes using some results. 


So, I would refer you to this paper. So, in this case, each data item is replicated on N hosts, 
along the thing, of course, if these are the end hosts, but of course, we will skip a few subject 
to different to the replication policy. The first is it will be it will be rack unaware. So, in the 


rack unaware policy, we just take the end successors. 


But in rack away, we skip those successors that are in the same rack. So, at least they will be 
across different racks of the same data center. What is a rack, well a rack is like cabinet of 
servers. So, it is possible that let us see the network connection to a rack can get cut. In this 


case, the entire rack will go offline. 


So, a rack aware replication policy will ensure that at least the replicas are in different racks. 
And what Facebook actually does is that it is more of data center aware in this case, you want 
at least in the ideal case, all n of the replicas to be in n different data centers. If that is not 
possible, you would at least like to maximize the span of data such that it is in different data 


centers. 


So, just in case an entire data center gets cut off, we can Facebook will still continue to work. 
Cassandra will still continue to work. So, that is the broad idea over here. So the Cassandra 
what it does, is among the set of replicas, so similar to Dynamo they also call this the preference 
list. And so, Cassandra, first elects a leader using the Apache zookeeper service and Apache 


zookeeper service is a distributed coordination engine. 


The job of the distributed coordination engine is to ensure is to provide certain key services, 
certain basic services. And the basic services include electing a leader and we will see it 
provides an implementation of a few more distributed algorithms as well. So, you do not elect 
a leader or do the preference list but you elected leader of all the nodes. So, leader will assign 
the replicas to the nodes. Along with that the meta data routing tables and topology information 
are maintained, of course at each node for ought to work, and also centrally by a zookeeper 
system to initiate repair and maintenance messages of the array. So, party Zookeeper, I would 
encourage viewers to take a look at this open source project, which has a wide array of 


distributed coordination algorithms. 
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So, for membership well, so membership information is typically gossiped using a gossip 
protocol. And gossip protocols were discussed in the second or third lecture of this course. So, 
in this case, Cassandra uses the scuttlebutt anti-entropy-based gossip system to propagate 


membership information. And this is propagated relatively quickly. 


And failure detection is a problem. The reason it is a problem is that we want to detect failures 
quickly. But to detect failures quickly, we need to actually send a lot of messages. So, what 
Cassandra does is that it uses a probabilistic failure detector for the accrual failure detector 


which kind of gives a confidence bound on the failure. 


So, it gives a value ¢ for each node that is not a Boolean number, which represents a level of 
suspicion level of suspicion that the node has actually failed. If d = 1 the chance of an error in 
whatever is being predicted is 10 %. Which in this case, if you see it has failed and = 1 well 


the chance of mice being wrong is 10 %. 


If db =2 itis 1 %. If @ =3 it is 0.1 percent and so on. It varies exponentially, and on the log 
scale, every node maintains a sliding window of inter arrival times have gossip messages. And 
based on this sliding window of inter arrival times of gossip messages, Phi is calculated, and 


weavers are welcome to take a look at the details of the actual failure detected on the web. 


(Refer Slide Time: 20:37) 
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Bootstrapping 


When a node starts, it chooses a random token} The token 
defines its position in the ring. 

@ It saves the token locally and also gives it to Zookeeper. 

@ It then gets a list of contact points from Zookeeper, and 
contacts them such that it can join the ring. 

@ The,membership information for the new ring is then gos- 
siped in the cluster 

@ There might be multiple Cassandra instances running (for 
different types of services) 


@ Tag each message with the( ins 
@ Administrators can manually manage Cassandra instances 
> 
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So, bootstrapping, how does the system starts well, when a node starts, it chooses a random 
token. So, we have used the same terminology as Dynamo. So, the token is the hash value of 
its IP address for example, it defines its position in the ring. The token is stored locally and 
also it is given to zookeeper which is acting as a global coordinator. Zookeeper in return gives 


it a set of contact points and known set of servers which are already on the ring. 


One of the servers while other contexts, servers can be contacted, so that it will help the node 
join the ring. And now given that a new node has joined the DHT ring, the membership 
information is gossip. So, of course, there might be multiple Cassandra instances running for 
different types of services. Cassandra is a generic system. So, you can tag each message with 
the Instance ID of which Cassandra message it is. And furthermore, administrators are given a 
console to manually manage the instances and manually manage the process of node addition 


and node deletion. 


(Refer Slide Time: 22:02) 
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Does the system scale well? Yes, of course. So, using a DHT was the best thing to do in this 
case, because the DHT automatically implies huge scalability. And of course, we can make so 
what will stop what is the main problem with DHT? Well, the main problem with a DHT is 
that it is possible that load balancing maybe non-uniform. So, for a given physical node, we 


might have a high load. 


So, this is kind of taken care of the virtual nodes. This is kind of taken care of by changing the 
replica placement strategy. Also, the token unlike Dynamo will not assign the physical node to 
different points in the ring. But the initial token assignment, of course, can be made much more 
intelligent. In a sense, a node can be assigned a position on the ring such that it is beside a 


heavily loaded node, such that it can take away some of its load. 


So, different combinations are possible. So, it is a job of Apache zookeeper along with a 
customized Cassandra engine to ensure that this indeed does happen. Once that has happened, 
once a node has been placed on the ring, so it will place it needs to get data. So, for all the rest 
of the nodes, they will be storing, they will be the owner of some key and recall that the owner 


is a clockwise successor of the key. 


So, it will basically get replicas from a few of the other nodes which are on the ring. So, replica 
transfer will be a fast kernel kind of transfer, where data is streamed into the new nodes. This 
process, of course, can be personalized. Similar to BitTorrent in the sense, we can take the 
replica data, break it into small chunks. And if let us say the node has multiple network ports, 
then via each network port, we can sort of streaming the data in parallel, similar to what 


BitTorrent actually does. 
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So, this is again, kind of like at a lower level, it is kind of lower level in the sense, it is more at 
the network layer. And so, so it is very started really at the network layer, but it is pretty much 
at a level that is lower than the DHT. So, in this case, what essentially needs to be done is that 
if, let us say, there is this node over here that we just added, and another node wants to send a 


replica to this node. 


So, clearly, there are many other nodes that also have the same replica. So, all of them can sort 
of divide the replica into small chunks and send the chunks to this node in parallel. So, this will 
speed up the process significantly if the node has multiple network interfaces and can absorb 


the parallel transfers. 
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So, outline a little bit of persistence will persistence basically means storing the results to disk 
such that if there is a memory crash, the data does not go away. So, this durable storage, so we 
rely on the local file system for data persistence. So, recall that this is a, this is a journaling file 
system, I will not describe what a journaling file system is, the expectation is that viewers will 


go to the web and read about it. 


So, a typical and the fault tolerant file system, which uses journaling is that we write to a 
commit logs instead of directly writing it to the file on the disk, we write it to a commit log. 
This is used to update an in-memory data structure. Furthermore, it is possible that this commit 
log will be regularly be persisted to the disk. And we can have a dedicated disk or a dedicated 


non-volatile memory for the commit log. 


And once this in memory data structure exceeds a certain size, we simply dump the contents to 
the commit log, which means that from the commit log, we actually change the underlying file 
system, so the commit log is kind of like a cache in the file system. And it is ordered 


sequentially. 


So, writes are sequentially made to the disk. And, and simultaneously, what we do is that we 
create or update an index for this data. So, such that if we need some data on a disk, since we 
are we do not have a regular file system that is organized into directories, we need to create a 
special index similar to a database index. So, again, a database index, I am not explaining over 
here, because I am assuming that readers would have viewers will either already know or will 


kind of look it up. 


So, the important point over here is that even if we have lots of data on the disk, it is important 
to have an index, it is important to index it. And it is important to create a database like index 
for doing that. And once we have a lot of files on the disk, Cassandra actually has files a variable 
size. So, the recent files kind of have like small sizes. And gradually, what happens is that older 


files are much larger sizes. 


So, they are kind of merged. So, we do not need them, we create large files, and we kind of 
keep them. So, this is mainly because inbox messages show a high degree of temporal locality. 
And because of the high degree of temporal locality, well, most of the time, we will get them 
in the in-memory structure. If not, we will get them on some small files and the disk and sell 


them do we have to actually access the large files. 
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So, what is what does the read operation look like? We first query the in-memory data 
structures. We search the files from newest to oldest. And of course, we can speed up this 
search process by using a Bloom filter. Again, I am not explaining but the Bloom filter, let me 


say, is this specific kind of a data structure that essentially indicates set membership. 


So, given a set of values, and now if it is presented with and this is also listed as a set, it is a 
set of values. So, now it is presented with a new value, it will essentially indicate a yes or no 
answer. Yes, means that it is there in the set and no means that it is not there. So, bloom filter. 
If it says yes, it is there, it does not mean that if it is actually there, it might be there it might 


not be there. 


So, essentially it allows for a false positive. However, if it says that the data item is not there 
means it is definitely not there, which means that a Bloom filter does not have any false 
negatives. So, false positives are there but false negatives are not there. And how exactly a 
Bloom filter works well, again, this is a exercise to the readers if they do not already know. 


Just go to Wikipedia and read about Bloom filters. 


So, the advantage is that the Bloom filter can very quickly indicate that a set of files do not 
contain a key, which is the key for the row. Alright, so why are we doing all of this? Well, so 


the reason why we are doing all of this is given a key, we want to fetch the contents of the row. 


So, this is required for both a get operation and an update operation. And so, for either, you 


know, it is like get and put in a dynamo, either for getting data or updating data. And so that is 
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the reason we first access the in-memory data store, and if that is not there, then we access the 


disk. And the disk, of course, has a range of files of different sizes. 


The recent entries are stored in smaller files, and later entries are stored in very large files, 
hoping that they will, we will have temporal locality. And we will typically not access later 
entries, and just in case we are users will be willing to tolerate the additional latency. And to 


speed up the search, well, how do we organize it, we have to create some kind of an index. 


So, we can go through a Bloom filter, which will tell you quickly whether it even makes sense 
to search here or not. If it is not there, we directly go to the disk. And furthermore, a row itself 
can have a lot of columns. So, what we do is we use a 256-kilobyte column index in the row, 


which allows us to exactly find out the byte offset of a given column. 


692 


(Refer Slide Time: 31:34) 


Desigr 


Implementation Details 


Main Modules 


Partitioning Module, Cluster membership module, Failure detec- 
tion module, Storage engine module 


@ Written in Java using non-blocking \/O primitives 
‘ @ Control messages use UDP 
( @ Allapplication related messages use TCP —_ 
gq Bm 


The implementation details well, we need to look, we need to consider several things we need 
to consider the partitioning module for partitioning the keys and the replicas a clustered 
membership how do we take care of membership failure detection and storage. So, the entire 
system has been written in Java using non-blocking I, O primitives. The control messages to 
manage Cassandra use UDP and all the application related messages use TCP Of course, this 


is required. Because this is where we need reliable communication, we do not want to lose data. 
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@ Identify the nodes that own the data for a key 
9 Route the requests to the nodes and wait for the responses 


@ If the replies do not arrive within a certain time, fail the re- 
quest (sud a fadbrc 

@ Figure out the latest version based on the timestamp 

@ Trigger a repair (if required ) and return to the client 


So, we identify the nodes that own the data for a key, we route the request to the nodes and 


wait for the responses. If of course the replies do not arrive within a certain time the request 
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fails. And then of course, we tend to suspect a failure. So, the failure approval mechanism that 
we use will pretty much take such inputs, but of course, it uses far more inputs also like the 


inter arrival time of WhatsApp messages and so on. 


However, if we do get a request, we find out a latest version based on the timestamp and just 
in case the repair is needed or emerge as needed, well, then we again, return the values to the 
client. But of course, the Cassandra paper as such, does not broach this issue in great detail. 
Hence, I will also not try to speculate on how exactly the versioning is done in Cassandra. 


Because this aspect is not really well covered in the paper. 
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@ We can delete commit logs for entries that have been _per- 
sisted 


So, it uses a rolling commit log. Well what does that mean? I mean, it is essentially a large log 
where data can be added and removed at any point in time. So, once a given commit log has 
reached a certain size, we create a new one and the old one is persisted to disk and the size 
threshold that is used is 128 megabytes. So, this recall that is an old paper roughly 2010 so, 
those days even 128 megabytes because you know we would like ideally like to keep the 
commit log in memory. And those days 128 megabytes was quite a bit of memory for an off 


the shelf processor. 


So, it was 128 MB was not quite a bit of memory, but of course, recall that we are many virtual 
nodes and then we are storing the return for each virtual node. So, and also the additional 
overhead. So, keeping all of that in mind, 128 MB was the maximum that designers thought 


they could afford. So, I stand corrected in my previous statement. 
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So, in Cassandra, we use an in-memory data structure that we have seen and a backing data file 
for each column family. Every time the in-memory data structure is dumped to the disk, we set 
a bit in its commit log. And this basically says that the data has been persisted. And each 
commit log will maintain a bit vector corresponding to the disk dumps that have happened, 
which tells you what has been persisted, and what has not been persisted. And for entries that 
have been persisted, we can kind of delete the logs. And we do not need them. So, that will 


free up space for us. 
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@ Write operation 
@ Normal mode : Unbuffered writes 
, @ Fast sync mode : All writes to the commit log, and the data 
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@ Fast 
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So, write operations are again of 2 types, we can have a normal, write which is unbuffered, 
which means that we directly persist and or we can have a fast sync mode. In fast sync modes, 
all the rights to the commit log and the data file are essentially buffered in memory. So, this 
makes the rights kind of fast, but just in case there is a crash, we will lose data. And so that is 


a problem. 


So, we sequentialized writes to the disk, of course, which is a fast, fast operation. And unlike 
databases, in this case, our data structures are very simple. They are essentially like linked lists. 
So, we do not have costly B-tree updates. I need to make a small correction here. This is 
actually not a disk crash, it as a machine crash. And the machine crash is something that wipes 
off the state of main memory. So, that is the reason. If the machine has crashed, its main 
memory is gone. So, but the data on the disk still remains. And even if he ca not get the machine 


up, you can always move the disk from one machine to the other. 
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Cassandra Index 


@ All the data is indexed based on the primary key 

@ The data file on the disk is broken down into a sequence of 
blocks. Each block contains at most 128 keys. 

@ A block is demarcated by the block index , which captures 
the relative offset of a key within a block, and contains the 
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So, let us now look at the structure of a Cassandra index. So, all the data, of course is indexed 
based on the primary key. So, the primary key is the index of the row. And within the row, as 
I said, we have columns, families of columns, super columns. So, the data file on the disk is 
broken down into a sequence of blocks. So, basically, what we do is if you consider the data 


file on the disk, we break it down into a sequence of blocks, and each block has 128 keys. 


So, a block, of course, is demarcated by the block index, which captures the relative offset of 
a key within a block and contains the size of its data. So, basically, what we see over here is 
that the search process would look like this that we search, first search the index and memory. 


If it is not in memory, we search the index on the disk. 


So, given a key, we can quickly locate it in within the index. And using this structure, of course, 
here we can find the block that contains the key. And from there, we locate the value of the key 
so, every key will have a pointer to its value. So, value is the location where the data for all of 
its columns is stored. So, again, the value we search for the value in memory data structure, if 
we do not find it, we then we search for the value in that data file. So, of course, most indices 
are stored in memory for quick access. Because in this case, we would like to minimize disk 


accesses as much and as far as possible. 
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Arrangement of Files 


Nw) aides 
@ Files are ordered according to their time of creation. 


@ We start by accessing the latest file (since we want the lat- 
est update) 


| @ Because of temporal locality of accesses, we keep recent 
data in small data files 


| @ Gradually we merge them and create bigger merged files 


The arrangement of files of course, files are ordered according to the time of creation. So, I 
start with traverse the file from the newest files from the newest to the oldest. So, since you are 
on the latest update, we access the latest file and because of temporal locality of accesses, we 
as I said we keep recent data in small files. And gradually we keep on merging the older files 
to create bigger and bigger files, which of course take more time to access. But we are also 


assuming that accesses to such files will be kind of infrequent. 
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So, what are the lessons learned by the developers in this. So, of course, they use Map-reduce 


jobs to process all the raw data or from a MySQL database and send it to the Cassandra 
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instance. They did implement atomic operations per key per replica that was not hard to 
implement. And the average failure detection time was actually high, it was order of several 


minutes with a regular failure detector. 


But with the accrual-based failure detection mechanism, they were able to bring it down to 15 
seconds. And the usage of zookeeper as a coordination service was very useful. So, in general, 
in a distributed system, at least for the leader election, replica management and load balancing, 


so let us write it down. You will need some kind of a distributed coordination service. 
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@ Maintains a per user index of all messages sent and re- 
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UL 
© Search features / 
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@ (b) Search by the name of the recipient «, 
@ For (a) Jérm Seah 
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column. Message ids are columns within the super column 


@ For (b) 


@ User id is the key and the recipient's ids are the super columns 
For each super column, the message id is the column, hy 


@ Uses index prefetching 


As the inbox search, so we maintain a per user index messages sent and received. So, we can 
search it as we had said in the first slide based on the terms and based on the name of the 
recipient. If we are to if you are doing term search, term based search. In this case, the user id 


is the key. And the message words are the super column. 


The different message was a different message words are the super columns. And for each and 
then the message id is that contain that term are the columns. Similarly, if let us say I were to 
search for the recipient, well, the key would be the root kind of the identifier. The recipients 


ids are the super columns. 


And for each super column, the message ids are the columns. And to speed up the execution, 
we use index prefetching, which means that we also prefetch the indexes and keep them in 
memory for very, very fast access, because we need to locate the key and given the key we 


need to quickly fetch its value. 
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So, this was a 2010 paper published by 2 Facebook engineers. And so the Cassandra system, 
henceforth has been used very heavily. It is very widely documented, as well, in different 
papers and in the technical press. So, I would ask the viewers to look at other sources of 


information and documentation on Cassandra. 
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So, in this lecture, we will discuss Facebook's photo storage, the title of the paper is a needle in a 
haystack. So, a haystack is a large storage. A haystack is essentially a stack of hay and the needle 
is a very small needle, which is very hard to find if it is inside a haystack. So, in this case, the 
haystack is the photo storage and the needle is a single photo, which needle as to say is very hard 


to find. 
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Facebook Photo Storage 


@ In 2010, Facebook had 260 billion images 
@ Users upload one billion photos (60 TB) in one week 
@ Haystack ( new and improved approach ) 
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So, we will discuss an overview of the approach, and then the design and the evaluation. So, this 
paper is around 10 years old, it is slightly dated. So, as of 2010, also, Facebook was very big. It 
had 260 billion images, which was very, very large for that time, even for now as well. Users 
would upload a billion photos a week, which would roughly be the order of 60 terabytes and the 


haystack system, which was a new and improved approach. 


In this approach, this was replacing the traditional approach which was used the network file 


system NFS. This reduced the number of disk accesses. Because an approach the main problem 
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with a traditional file system like NFS was the increased number of disk accesses. So, this is 


something that the aim was to explicitly reduce. 


And also, the all the aim was to minimize per photo metadata. Because we would keep some data 
for the photo as well as some meta data to essentially identify, so, the main thing it would contain 
is the size and the status of the photo. And of course, something like its starting position on the 
disk. So, the starting position, the size and the status, we will come to what the status is later in the 
lecture. All these three things the size, the status and the starting position of a photo. All these are 


essentially things that need to be minimized. 


And also, we need to serve a million images per second, which was Facebook's peak serving rate 
those days. So, that is the reason they had to create a novel custom bespoke solution a photo object 
store essentially a store only for photos. So, this was known as haystack and haystack of course, 
is in the line of things that we have been since studying over now. So, we have looked at a few 


projects that essentially use the same elements. 


So, we have looked at percolated Google percolator, which is built on Bigtable. So, essentially, 
there is a distributed file system and there is a DHT, some former you have also seen CODA, 
which is a distributed file system. And then we have discussed dynamo, which is, again, a one-hop 
DHT. And then we discussed Cassandra, which was for the inbox search problem. So, of course, 
in the inbox search, we are looking at small messages not large photos. So, if a viewer were to ask 
why we cannot use Cassandra to store photos, well, Cassandra is not meant for that. Cassandra, 
the main aim of Facebook, Cassandra was to store small messages, not large photographs, that was 


not the aim in this case it is. 
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@ Pattern: Written once, never modified, rarely deleted 


So, Facebook saves each photo in 4 formats. So, the formats are large, medium, small, and 
thumbnail. So, these 4, so essentially, every photo that we create, it is stored in these 4 formats. 
And let us say we apply, some sort of a transition to a photo like we rotate it, then essentially, we 


create a new photo. 


So, similar to our previous assumptions, we assume that data is immutable. Immutable means that 
it does not change. And so which means that we write once and read many times. And if I were to 
create a new version of it, like rotate it so, essentially, that is like uploading a new photo, which is 
the same pattern it is written once never modified. And also, it is rarely deleted, in the sense that 
we very rarely delete photos from Facebook. Unless a photo has been uploaded and error, we 


typically do not delete photos. 


So, any POSIX file system, what is a POSIX file system? Well, the Linux and Windows file 
system, so Linux, ext3, ext4. So, the Linux and Windows file systems are these file systems are 
POSIX compliant file systems. So, they have a lot of overheads, it is okay for a general file system 
but it is not the best solution for specific for something for storing a specific kind of data specific 


type of data. 


So, what adds to the overhead? Well, we have directories that adds to the overhead in the sense 


that the file systems support a very hierarchical system. So, that adds, kind of adds to the overhead. 
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The further thing is that we store a lot of additional data, like permissions, and so on, which are 
not required, writes of storing photos, we do not require directories, we do not require permissions. 
There are problems with traditional NFS distributed version of a POSIX file system in a sense, we 
need several accesses to read a file not one, several. So, given the file, we have to retrieve the inode 
number by traversing the directory tree then read the data. Even if we do caching, it does not help 


a lot in reducing disk accesses. 
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So, the requirements that we have for a photo object store, the requirements are high throughput 
and low latency, it is not high throughput and high latency as was the case and other approaches, 
which was like a more of a batch processing. So, in that case, you need high throughput and high 
latency, when you typically do not care how long it takes. But this is not the aim here. The aim 
here is a very high throughput and a low latency. And so basically, for that, we have to essentially 


create a very tiered system that would allow this. 


So, first, we define the certain processing capacity, which means a certain quality of service for a 
server, which essentially talks about the number of maximum number of requests that can be 
served per cycle. So, if you are exceeding this, you can ignore messages. Not a good idea. Or what 
we can do is we can have Facebook system. And we can have something like Corona or Akamai 


a content distribution network. 
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So, what the CDN does is the CDN is a set of third-party servers, which essentially buffers 
Facebook data, some of the popular data, and it serves that to the users. So, in this case, what the 
CDN actually does is that it acts like a cache for Facebook. It takes popular data and it serves it to 
the users. And so, the CDN is a third party offering and so in this case, whatever is popular can be 
served. But the main problem and you would also see this in the paper that if I were to plot kind 
of the popularity on the y axis and order the Facebook pages on the x axis the graph would be 


something like this. 


So, let us say the you will have a set of very popular Facebook pages, mainly for celebrities, 
politicians. But for most people, most normal users and mind you, this will be a massive tail, you 
really have a long, long heavy tail of just regular normal users. So, their photographs will not be 
extremely popular. As a result, the caching strategy of the CDN will not really help. So, it will 
essentially help for the celebrity part. But for the regular normal users, we need to use the default 


Facebook system. 


So, this is the two layered architecture or this is the two layered structure of Facebook and CDN 
where the CDN is third party, as I said, like Corona, or Akamai, which serves the more popular 
pages, Facebook pages, of course, but we need Facebook's indigenous haystack system for the rest 


of the requests and they have to be served with a high throughput and a low latency. 


We have some additional requirements as well, that we only want one disk operation per read, not 
many, not more than one. And also, another recurrent pattern that we have been seeing is that all 
the metadata of a file that needs to be stored in main memory. So, instead, the meta data, at least 
can be accessed very quickly. So, the meta data gets stored in the main memory cache and the rest 


on disk. But there also we are allowed a single access. 


Fault tolerance. Well, this is also a recurrent pattern that we have seen. We saw that in Amazon 
Dynamo where the replication was across data centers. So, this is required, because if one data 
center fails, we would like another data center to take over. Furthermore, even within a data center, 
we would like to have replication across racks. So, it is not a one set of racks fail another set of 


racks can take over. 
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So, with all of these modifications, haystack registers a 4X throughput improvement over baseline 
NFS. And the cost per terabyte, the cost per terabyte of data serving a terabyte of data is 28 percent 
less, which is substantial for a server farm. So, now we need to understand where these numbers 
come from and what exactly did Facebook do to get such a large improvement in throughput, and 


also simultaneously at a lower cost. 
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So, this is the typical flow of kind of a simplified, simplistic Facebook system. So, in this case, the 
web browser actually requests the web server for a photo, or for a set of photos on a page. The web 
server replies, but it does not reply with the photo, it replies with a kind of internal Facebook URL. 
The internal Facebook URL is sent to the CDN, which acts as a cache for the photo storage it acts 
for a cache for haystack. If CDN has the data it returns it otherwise, it forwards the request to the 
photo storage which then retrieves the photo from the haystack storage and returns it via the CDN. 


Some of the popular photos may be cached by the CDN for later users. 
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So, let us now look at the default approach that uses a distributed file system NFS. So, as we have 
discussed using CDNs content distribution networks is not an effective solution. The primary 
reason being that requests have a very long tail the popularity distribution has a very long tail and 
the CDN caches only the most popular photos. So, most requests are anyway sent back to the 


backing photo store which in this case is haystack. 


Furthermore, what NES would do is that it would save each photo as a file on a commercial NAS 
appliance. So, NAS appliance or network storage appliance where the storage device is connected 
on the network and is accessible over the net. So, what we are essentially doing is that in this case, 
give in the URL, it is being mapped to some sort of a storage volume and the data is coming from 
the path of the file. Of course, we can do a default implementation where we save hundreds of files 
per directory, to kind of minimize the depth of the directory tree. This would require 3 disk 


accesses read the directory metadata. 


So, this is the first access node, node of the file and read the file contents. So, the optimization is 
that we can cache file handles. So, caching file handles does not really help because of the heavy 
tailed nature. So, this is not very helpful. And the reason is that for this range, which is very, very 


long, caching does not work. 
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There are a few competing technologies like MySQL, well, this is MySQL is meant for relational 
data. It is a relational database. The data that we are looking at here is not of a relational character, 
hence, it is not useful. We have looked at the combination of GFS and Bigtable in Google 


percolator. 


So, in GFS, we divide a file into chunks, we replicate the chunks and stored them on different 
servers. But in this case, we are dealing at the granularity of photos. So, this is not a very scalable 


approach plus GFS tries to give us kind of a general file system, which is not what we want. 


And big table is again, built on top of GFS that takes a large multi dimensional table, splits it into 
small chunks, replicates those chunks. Again, this is not the best solution for us, because our data 
is not a multi dimensional table. And we have already looked at the shortcomings of traditional 


NES on NAS storage devices. 


So, what we actually want to do is that we want to create a new kind of file organization that uses 
elements that we have seen in the past. So, we will have we will store all the metadata of the file, 
which includes the its position on the disk, and its size and its status flag. So, this in some sort of 
easily searchable form an index in RAM and then actual file will be stored the actual photograph 


will be stored on disk. 
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So, of course, we would like the RAM to disk ratio to be as high as possible in the sense more the 
RAM the better it is. Of course, there are substantial cost implications with a larger RAM 
particularly in a very large data center. So, the aim here is to find the right mix the right balance 
such that we can minimize the cost for creating the data center as well as kind of maximize the 
RAM to disk ratio right at the same time. And as we have been discussing several times the 
problem for serving users Facebook pages cannot be outsourced to CDNs. So, for the Facebook 


pages of celebrities, yes, but not for regular users because of the heavy tail pattern. 
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So, the architecture of haystack has 3 components store, a directory and a cache. So, we will 
discuss them in turn. So, the store is the actual photo store that stores all the photographs. So, what 
we do is that the entire photo store which is of course a distributed store is grouped into logical 


volumes, think of this as a large directory, a large distributed. That is a logical volume. 


So, this can be thought of as a large distributed directory. Each logical volume will be stored on 
several servers. So, then each replica will be called a physical volume. So, each logical volume, 
let me call it an LV will be stored on multiple servers. On each of them, each replica will be called 


a physical volume. 


And needless to say, on a single server will never have two replicas of the same logical volume, 


because the aims of redundancy are not really being served. We then have a haystack directory 
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which actually does the overall control which exercises the overall control in this process. So, what 
the haystack directory does is that it does a logical to physical mapping in the sense, given the 
logical volume ID it creates gives you the physical volume ID, which is essentially includes the 
ID of the machine that stores the physical volume how server, it is specified either IP address is 


the easiest. 


Second, given the photograph, it maps it to the logical volume. So, what it essentially does is given 
an identifier for the photograph, it maps it to the logical volume and further maps it to the physical 
volume and physical from the physical volume, we find the machine that stores it. And what the 
machine actually gets is a tuple of the photograph and the logical volume. Because, the machine 
might be storing multiple physical volumes one corresponding to each different logical volume. 


So, this tuple is actually given to the machine such that it can find the photograph from within it. 


And what is the cache? Well, the cache is a DHT, a regular DHT that we have seen in the past, 
something like Pastry, chord, dynamo etc. So, it is a regular DHT, which is an internal CDN of 
sorts, which can cache data and give it to you. But of course, it is not a small in machine cache, it 


is a large distributed cache. And the large distributed cache is organized as a DHT. 
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So, let us now take the previous figure that we have shown and add some details into it some extra 


detail some additional details into it. So, one of the detail is detail is like this, that given our web 
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browser. So, essentially, we open a Facebook page on a web browser, the web browser sends the 


request to the web server. The web server then sends the request to the haystack directory. 


So, what does the request have? Well, the request has either the identifier of one photo or multiple 
photos. Then what the haystack directory actually does is that the haystack directory then the first 
decision that it takes is whether it should direct the user directly to the CDN or to the or directly 
to the haystack cache in the sense one connection of this nature exists where the user can be directly 


directed to the haystack cache by bypassing the CDN. 


And second, given a photo it does the mapping photo to logical volume to physical volume to 
physically where the machine is. So, this of course is not done by the haystack cache, so, the 
haystack cache would only do this much a mapping and what this would be done with a haystack 
directory. This is something what it would return to the user well also this would also be there. So, 


these 3 things will get returned to the user. 


So, user means the user's browser and the user's browser can then send it to the CDN, if the CDN 
has the photo then it will of course send it back otherwise, the CDN will send the request to the 
haystack cache, the haystack cache as you can see from the structure, it is organized as a like the 


DHT. 


And then if the haystack cache has the photo, well and good, that is great, then the photo will be 
sent back to the photo will be sent to his back to the user or if it does not have it, it will send it to 
the haystack store. The haystack store will then search for the photo within the physical machine. 
And once it gets the photo, it will return the photo to the haystack cache. And the haystack cache 
will of course store it over here. So, this at least takes care of some temporal needs. Then, of course, 
we have an option that depends on what the directory has originally configured the request to be. 


So, the first option is that it can be sent directly back to the user directly back to the browser. 


And the other option is that it can be sent to the CDN for further caching, which is what we show 
in this figure. And after caching it the CDN can return the result to the browser. So, this is 
completely a function of exactly how the request has been configured by the haystack directory 


where it actually goes first. So, this is like you can think of the request response cycle as having 
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two distinct phases. So, the first phase is where we talk to the directory, get an expanded URL for 
the photo. And then we make a fresh request via the CDN via the CDN cache and store. 
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So, the web servers use the directory to create a URL for each photo. So, this that is the job of the 
web server. And so then the URL that they create will be of this type it will be HTTP, because 
HTTP protocol is being used. So, the first entry will contain an entry to the CDN. So, it will be the 
IP address of the CDN. And so, it will point the user to the CDN. And then the user's browser can 


retrieve the photo from the CDN. 


So, what is being sent? Well, the logical volume and photo is being sent such that the CDN gets 
the first chance. However, if the CDN does not have it, then automatically it is sent to the cache, 
the CDN sends it to the haystack cache, if cache has the logical volume photo tuple it will return 
that to the user. If the if again the cache does not have it, it is sent to the physical volume, which 


is a physical machine within the haystack store. 


The physical machine will have it for sure. So, it will again take a look at the logical volume photo. 
So, logical volume, of course within the physical machine is stored as a physical volume. And then 
it will look up the physical volume, find the photo and return. So, the upload process is kind of 


analogous where the user contacts the web server, the web server contacts the haystack directory. 
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So, in this case, the directories job is so since it is a new photo, it is a fresh photo the director's job 
is to assign a writable logical volume. And so, the writable logical volume, so these logical 
volumes are fresh, any logical volume can be assigned. So, in this case, the directory assigns a 
writable logical volume. The web server on its part, then sends a request to the haystack store, the 


store will write the new photo to all the physical volumes that store a replica of the logical volume. 
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So, let us now discuss the functionality of the haystack directory. So, as we have discussed before, 
it provides a mapping from a logical volume to a physical volume. So, this is like deciding which 
replica we need to send, we need to send a request to so, note that in this case, since this is read 
only data it is immutable data, we do not really need a quorum, so we do not really need to read 


from a quorum of replicas, there is no version management. 


So, version management and so on that we had in Amazon, all of this is not there, we just can send 
it to any one replica and read. And every single photo is protected by a checksum. So, there are 
internal mechanisms for finding out if bits or bytes have been corrupted or not. So, one of the 
things that the haystack directory does, is that it does a load balancing of the rights across the 


logical volumes, which means that across all the logical volumes that are there. 


So, so let us say that writes needs to be done. So, of course, the write can be sent to any logical 


volume because it is a fresh photo. So, since the write can be sent to any logical volume, so we 
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need to choose that logical volume. Whether it be some load balancing, there is not a lot of traffic 


in any one volume. So, it will do a load balancing. 


Also, it makes a determination if the request will be handled by the haystack cache, or the CDN. 
If it is a popular page, it can be handled by the CDN else it is directly sent to the haystack cache, 
or it might be sent to the haystack cache via the CDN. Also, it marks volumes as read only once 
they fill up. So, every logical volume has a capacity. So, this is the maximum size of the logical 
volume. So, once the logical volume fills up, the entire logical volume is kind of sealed, and it is 


marked as read only. This means that the logical volume will further not expand. 


So, what we need to do is we need to start more machines and create new logical volumes because 
as I said, every logical volume has a maximum size. So, this means that to create more of the 
writable volumes to write new photographs, we need to start more machines. This means the data 
center Facebook's data center has to continuously expand just to grow. Because what was found 


out was that roughly 20 to 25 percent of the photographs are deleted. 


So, in general, we do not delete photographs from our Facebook profile unless they have of course 
been uploaded in error. So, since the photographs do not have to be deleted Facebook's storage 
capacity keeps on increasing. That is the reason we have a cap on the size of a logical volume. And 
once we reach the cap, it is marked read only. And we initiate we bring in more machines into our 


system. 
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The haystack cache well, it is very simple. It is organized as a regular DHT. The key is the photo’s 
iD and the value is the photo’s data. Of course, if an item is not there it is sent to the store. So, here 
there are some interesting two interesting schemes that we should talk about. So, the haystack 
cache can get a request either from the CDN or from the user's browser directly, depending upon 
the way that the haystack directory has configured the request. So, if it is coming from the CDN, 


then it caches a photo. If it is actually coming from the user, not from the CDN. 


The reason is that the CDN itself is a cache. So, there is no reason why you should also cache the 
photo. Because if CDN does not have a photo, it will anyway get it from the store and then the 
CDN will cache it, so there is no reason for us to cache it twice. But if the user is directly requesting 
the haystack cache, then actually cache the photo. The second is that it does not cache photos for 


read only logical volumes. 


The reason is that a volume becomes read only after a certain time. So, this can be roughly let us 
say, after 100 to 200 days, once photos are not used anymore, we sort of configure the system such 
that the logical volume fills up, and it becomes read only. And at that point, what the engineers of 
Facebook actually saw is that we do not need to have quick access to these photos. Because users 
very rarely look at photos that are more than 3 months or 6 months old, very rarely I will take a 


look at them. 
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So, they are still a part of your Facebook profile. But they are not accessed accessed that frequently. 
That is the reason there is no point in caching those photos. Hence, when a volume becomes read 
only we do not cache its photos only for write enable volumes that are in a sense holding more 


recent data recent photos, we catch them in the haystack cache. 
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Let us now come to the haystack store, which is the third component of our discussion. So, each 
store machine will manage multiple physical volumes, where each physical volume actually 
corresponds to different logical volume. So, this is here is one of the key innovations of the 
Facebook engineers that they design the physical volume to be a single very large file, not multiple 
files. Otherwise, we would have had to do a directory access and get the details of that. It is a single 


file and it is a very large file. 


So, in the large file of course, as a large file it is organized as a stack, so pretty much what we do 
is that we have the data for one photo over here, the data for the next photo, so on and so forth. So, 
essentially, we just concatenate the photos, the data for the photos on the large file and it is 
organized as a stack. So, it is never the case that we delete a photo there is a hole over here, we 
tried to put in something that is not done. So, we essentially have immutable data, data that does 
not change. And like a regular stack, we just keep on adding photos this way in this into this very 


large file. 
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So, for accessing a photo on a machine, what we actually need is a logical volume iD, because that 
is how it will map it to the physical volume, which is this huge large file, the offset of the file, 
which is the starting index of the index of the byte within this large file, and the size of the photo, 
which tells us how large is the photo, such that we only read this part from this large file, so this 


becomes the photo. 


So, as you can see, the data organization is very simple. We have a large file, where these photos 
are concatenated one after the other. The only thing that we need to store is first idea of the file, 
which comes from the logical volume iD the offset, which is the starting position and the size of 
the photo, which is the end position within this file. Furthermore, the store machine or the machine 
that stores these volumes, keeps a mapping in memory in memory mapping or photo iDs to the 


meta data and meta data is this data. 


So, from every photo iD to this better data there is a mapping which is stored in memory with the 
accessible stored in memory. So, this is similar in the similar pattern that we have been seeing, 
where we keep the mapping in memory and the actual data, some of it in memory, most of it on 


the disk. 
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Now, let us come to the file structure. So, as we have discussed, the physical volume is a huge 


large file which is growing like a stack. So, first, it starts with something called the super block 
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which kind of identifies has all the meta information regarding this and a sequence of needles, 
where a needle is essentially one photo. It is a sequence of needles. So, each needle has the 


following fields. 


It has a header. So, of course header identifies the name of the photo, type, and so on. Then it has 
an important field known as a cookie. So, the cookie is a random number, the aim of the cookie is 
essentially to defeat a certain kind of attack. So, imagine I have my Facebook profile and in that if 
let us say my photos are kind of named, wwwfacebook.com/srsarangi/1.jpg 2. jpg 3.jpg, it will 
become very easy to guess the iDs of photos, I can start doing bulk downloads. And I can even if 
there is little bit of laxity and security, I can guess the URLs of photos of other users, I can 


download them want to defeat that. 


So, a cookie is a large number. So, it is a 32 to 64 bit number. So, this is a random number which 
is associated with each photo. So, to get access to a photo, we kind of need to know its URL and 
also supply this random cookie which you can think of it like a password for the photo and only if 
the cookie matches the photo is sent back. So, this makes it hard to guess this makes it hard to 


guess photo URLs. 


Because for every photo that we get the cookie has to be supplied and we will not know the cookie 
unless we are a legitimate valid user and the photo is also being accessed in a legitimate and valid 
manner the browser will not have the cookie for a photo. Every photo internally is identified with 


a key and an alternate key. 


So, the sum of the key and alternate key is 96 bits. So, that identifies a given photo then we have 
the flags field. So, the flags field is very interesting. So, we have till now maintained that a photo 
is not deleted. So, if photo is not deleted, then so, that is the reason we have an ever growing list 


in one direction. 


But in practice, we can delete photos from Facebook, even though it is rare. But nevertheless, we 
can delete photos, we can delete entire albums from Facebook, this is possible to do. So, in this 
case, we assure some mechanism to delete and the mechanism is very simple, we keep a single bit 


the delete bit. If this is set to 1, that means it is deleted, the delete bit is set to 0, it means it is fine. 
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So, what we do is we keep a single status bit a single status field within a flag. And this is within 
the flags field of the needle. And to delete a photo well, it is as simple as just saying that we have 
set a bit to 1 which means that the next time the haystack system actually sees the flags, it will see 


the delete bit and it will automatically conclude that the needle has been deleted. 


Then the regular fields we have the size field number of bytes the data the data or the photo the 
actual data of the photo and a checksum to just verify that if a single bit or a single byte has been 
corrupted or a few bytes have been corrupted, it will allows us error detection and recovery. So, 
the mapping between the photo iD so the id of a photo and the needles fields, the offset the size 
etcetera is kept in memory. So, the memory we have a mapping we have an index from the photo 
iD to the corresponding needle. And so, the aim of the cookie field was mentioned that is hard to 


guess the URL of a photo. 
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So, now let us come to the write and delete operations. So, for the write operation, we provide the 
logical volume iD which of course we get from the haystack directory the key and the alternate 
key the cookie, which is a random number, and the data the data or the photo once we have all of 
this we go for the write each machine each physical machine updates its in memory metadata 


index, it creates a needle and it writes the data. 
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So, each machine independently that stores the associated physical volume updates in-memory 
metadata index place the needle writes the data and since a photo is never modified, if we remove 
read eyes rotate the image a new image is created and is saved with the same key and alternate key 
so with the same pair of keys we save it but it is just that in the meta data, in the metadata that is 
stored in memory previously were pointing to 1 needle, we just changed the mapping to the new 
needle photo delete well, we have discussed this in the past that we set a single bit in the volume 
file and in-memory data structure to indicate that the file the specific needle, the photo associated 


with the needle has been deleted. 
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So, let us now discuss the structure of the index file. So, the index file is used to create the in- 
memory data structure. So, while doing a reboot of a machine the index file is created. So, it is 
essentially checkpoint so it is either created or it is read. So, it works like a checkpoint of the in- 


memory data structure. 


So, the in-memory data structure what we do is we read the index file, we map it into memory and 
that becomes the in-memory index. This has a similar organization we have a superblock and 
again, a set of needles, where in this case, a needle is essentially a pointer to the needle in the data 


file, again, a set of needles, so all of these are needles. 
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So, the index file and the data file are kind of updated a synchronously. So, it is not like we take a 
lock or a semaphore or something and update both in tandem. So, the both of them might not be 
in sync all the time. So, the application is aware of that. So, of course, the data file is primary and 


the index file is kind of secondary. So, that is kept in mind. 


So, one thing that we do is after a reboot, the store machine first runs a job to bring the index file 
in sync with the data file. After that, for a long time, it does remain in sync gradually, a 
synchronous, some amount of a synchronicity keeps up but then again, when the load kind of 


reduces there, again, brought in sync. 


So, we can say that the lag between the index file and the data file is variable. But definitely either 
periodically or after the reboot, we ensure that we read the data file and ensure that all the entries 
that are not there in the index, are brought in. One advantage is that since the data file is uploaded, 
is updated sequentially just like a stack, it is updated in the sequence and the index file is also 
updated in the sequence so, we can pretty much say that there is a one to one mapping between the 


data file and the index file updates. 


So, the index file updates that we could not actually do in a given timeframe. If we just queue 
them, then they will also be in the same order. So, what we need to do is we need to apply them to 
the index file in the same order and finding out how many outstanding updates are remaining, that 


is actually very easy, you just need to check the size of the queue. I will or we just keep account. 


So, let us say if the data file has sent 200 updates and the index file has seen 190 then we know 
that the 10 most recent updates to the data file are not there in the index file that is subject to the 
fact that we apply the updates in order and given that the data structure is also inherently sequential, 


this can easily be done to simplify the entire process. 
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So, the file system underlying file system, well store machine should use a file system that allows 
them to perform a quick random seek on a large file. So, note that these are not sequential accesses, 
these are random accesses random seeks. So, the store machine uses a variant of, it is a it is a little 


known Unix file system called the extent file system. 


So, in the extent file system in a normal file system, what happens is that we allocate at the level 
allocate space at the level of disk blocks. And a disk blocked could be 512 bytes could be half a 
KB, 1 KB, 2 KB, not more than that. But in the case of XFS, we typically allocate, chunks blocks 
fare, in I mean essentially in a much larger chunk in a much larger chunk and so maybe at level | 
GB and create a block map for it, a block map is the index corresponding to the chunk that has 


been allocated. 


So, what we do is maybe we allocate 1 GB chunk, and then we create create a small block map for 
it, which again, can be cached in main memory. So, this means that we have all the indexes over 
here. And then we let it run out of space on the data file. We give it a new extent. And a new 
extent, let us say is not 1 GB extent, we create a small block map for it, which again, extends the 
index. And this small block map kind of stores pointers to all the needles in the newly allocated 


storage. 
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So, this also allows us for efficient preallocation of files if we are kind of aware of their sizes 
before the data comes. Also, what we can also do is that if we have an approximate idea the size 
of a photo. Let us say that they are 100 KB in a more or less, then we can kind of allocate 120 KB 
needles over here such that there will be some amount of internal fragmentation but then we can 
quickly reserve space for them and quickly insert them at the relevant needles. So, these things can 
be done. They are kind of optimization through the baseline Facebook photo storage file system. 


So, they have not been discussed in very great detail in the paper. 
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Recovery from failures, so a background task for pitchfork runs, this periodically checks the health 
of each stored machine. So, how does it do that? It attempts to read data from the stored machine. 
If it finds a problem or thinks that the disk is not accessible problem in the disk problem, the 


network it maps the machine is read only, which means the new writes are not sent to the machine. 


And then we try to fix the problem. And the (ma) and if the machine is an if, and if the machine is 
otherwise fine, which means that it is possible to fix the problems, either via rebooting or by doing 
some kind of a configuration change. If we can fix the machine, then we start a bulk sync operation 


from another replica to synchronize both the data as well as the index files. 
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Some optimizations over here which are due, compaction, we reclaim the space of deleted needles 
or duplicate needles how do we do that, well, so the only way is we do not create holes in an 
existing file, we take the old volume file, and we dynamically move all the valid entries to a new 
volume file. So, let us say a few of these entries are invalid does not matter. So, this entry goes 


here, this goes here, this goes here, so we dynamically kind of compact it. 


So, over a year, roughly 25 % of photos get deleted. So, this is the amount of compaction that we 
can do. Along with that photos get modified because we make changes to them, like rotating them 
removing red eyes applies, applying filters, and so on. One more thing that we can do is that we 
can for a deleted photo, instead of setting the flag bit to 0, we can send it offset to 0, which is like 


a special code being given, which tells us that look, photo has been deleted. 


So, on an average for each photo, so mind you, we have 4 photos, | photo actually produces 4 
photos, a large, a medium, a small and a thumbnail, for each of them is spent roughly 10 bytes. So, 
we have a total of 40 bytes of main memory 40 bytes of metadata main memory that we save per 
photos. And also, we try to sequentialized the writes because traditional storage devices such as 
commodity hard disks, group photos into albums, and so we just try to sequentialized the writes 


as much as possible, increases the performance at the server side. 
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So, if I were to plot the cumulative percentage of accesses the y axis with the age of the photo, 
which is a graph that we have been referring to, and this is also the kind of locality behavior, the 


shape of the curve is something like A(1 - e~®*) kind of a general curve of this type. 


So, curve of this type is something which kind of has an asymptote, a strictly defined asymptote. 
And so, what we see is 90 % of the cumulative accesses, all the accesses are less than 600 days 
old. So, roughly within one and a half years, we access 90 % of accesses fall within that, which 
means we definitely do not access old photos. And if you see the chart in the paper, you will find 
that within 3 months or so, almost all photos are accessed otherwise, that access probability reduces 


to very, very low levels. 
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So, given that some statistics, so of course data publication is 2010. So, in 2010, 120 million photos 
were uploaded per day. So, this means and 1.44 billion haystack photos were written. So, how 
does this come? So, 120 billion, this is which is point 12 billion into 12. So, how does 0.12 * 12 
=1.44 the factor of 12 comes from two numbers. So, first is that there are 3 replicas of each logical 
volume. So, we write it thrice and 4 is because every photo is stored in 4 separate sizes, large, 


medium, small, and thumbnail. 


So, that is the reason even if 120 million photos were uploaded per day, 1 point 44 billion historic 
photos were written and roughly during this time period and 100 billion photos were viewed. The 
view stats 85 percent, most of them are used as for small photos, 10 percent of thumbnails. So, 
these are the ones that you see on a typical Facebook wall. And in large photos, which are ones 
that we kind of expand and see are only 5 percent of the views. And so, so that kind of gives us 


some kind of a statistical estimate of how the photo store is internally optimized to deal with them. 
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Read and write operations, well the read and write operations or the majority of operations are 
reads in a Facebook like system. So, one is that we are actively accessing a photo. The other is that 
let us say if I just visit my friends profile, I get to see hundreds of photos, even though I actually 
do not want to see them, the (Faces), the Facebook page is kind of full of photos, I see photos of 
friends, many things, he has been doing photos of friend’s friends, all of that, but I am not really 


interested in all of them. Nevertheless, I see them. 


The writes in comparison, so uploads are far fewer and as you have seen, an upload is also a more 
expensive operation, because we need to give it a logical volume and then multiple writes have to 
be done to the physical volume and given every photo, we have to generate 4 versions of it. So, all 
of that requires effort requires time. And but thankfully writes are not as frequent as reads. And 


there are almost no deletes in the sense deletes are few and infrequent we can deal with them. 


One of the interesting observations is that reads are much slower than writes. So, the one of the 
reasons is that of course, writes are kind of off the critical path, we can quickly write something to 
a cache and come back and gradually the write will kind of percolate deeper into the system. In 
comparison reads are on the critical path. So, we directly need to go to the machine and we need 


to do the read writes in many cases that can be cached in-memory. 
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So, you can write to memory and say that look, later on the data from memory will get persisted 
to disk. But in read we will have to actually have to go and perform the discrete to actually get the 
data. And the read times are kind of high 10 milliseconds but the write times are small, 1.5 


milliseconds. 
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So, this paper was published in OSDI, in 2010. And of course, after that, Facebook itself has grown 
significantly. So, so Facebook is no more than Facebook of 2010. Since 2010, I still remember that 
a lot of people were still not using Facebook to that extent and so they, Facebook was kind of 


coming up those days. 


But nowadays, of course, Facebook is ubiquitous. So, everybody has a Facebook page. And a lot 
of major updates are being sent on Facebook page, academic institutions have their Facebook 
pages, everybody has them. Celebrities have them, politicians have them policies are discussed on 


Facebook. So, in fact, if you are not on Facebook, it appears that you do not exist. 


So, this was not there 10 years ago, so those 10 years ago itself, we are looking at so many photos, 
billion plus updates. So, nowadays in the current regime, things are, of course, an order of 
magnitude more. But many of these things are not documented. So, we will have to wait for 
Facebook to publish more papers such that we get an idea how they handle today's skill, and what 


modifications they have had to do to actually handle data and traffic in today's skill. 
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In this lecture, we will discuss the Voldemort project. So, the Voldemort Project is an open- 
source project. And, but it has some important modifications were made to the open-source 
project to serve large scale data to LinkedIn. So, LinkedIn, as we all know, is a professional 
networking site. And so, LinkedIn engineers, were working on many modifications, they 
worked on Voldemort, and Voldemort later on was open source. But the important aspect of 
this lecture is not to discuss Voldemort. Rather, it is to discuss modifications made to 


Voldemort, for LinkedIn’s particular specific needs. 
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So, we will first discuss the overview of the design, the overall structure and evaluation. So, of 
course, in this case, we will rely heavily on the Amazon Dynamo paper. So, this kind of builds 
on Amazon Dynamo. So, it is important that viewers take a look at the lecture on Amazon 
dynamo, which is there in this lecture series is there in this playlist. But that is a very important 
prerequisite for this. Because without understanding the dynamics of Dynamo, they will not be 


able to follow this paper. 
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| Data Intensive Web Sites 


Many data intensive sites contain the following features: 
@ People you may know ... 
@ Items you may like (recommendations) 
@ Relationships between pairs of people 


@ LinkedIn (135 million users) features many more such rela- 
tionships 


Three phases : data collection, processing, serving 
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of LinkedIn by actually showing you my LinkedIn site. So, this site that you see over here is 
my LinkedIn site. And that is me. So, LinkedIn, of course, does a lot of things like we have 


some news over here regarding Coronavirus, and so on and all the posts on my site. 


So, I would like to take you to my network. And this is the way that my network looks now. 
So, the most important thing, one of the key things that LinkedIn actually does, is that it 
suggests a name of people, a lot of people that I may know, I might know. And this is actually 


the way that I make connections. 


So, let us say that suggests a lot of people that I may possibly know. And in that so I can 
essentially look through them and then I can make connections. So how does it know them? 
Well, it is kind of mind my behavior. So, from my behavior, it has some idea of the kind of 


people that I like to connect with. For example, I work on cybersecurity. 
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So, it shows me cybersecurity people. And also, I work in computer architecture. So, it is 


showing me Intel’s site, I read sources, just look at how much LinkedIn tracks my activity. So, 
I read economic news quite a bit. So, the economic times shows up over here. I work on 
multicore computing, multicore shows up over here, I am from IIT Kharagpur, IIT Kharagpur 
shows up over here. And I also mentored some startups. So, I love startups also shows up over 
here. So it is like my entire personality is reflected over here. So, find all of this data, LinkedIn 


had to do a lot of data mining. So, this is the people you may know. 
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Another thing that LinkedIn actually does is that it does something called collaborative 
filtering. So, let me go to the website of my friend, Partha Dutta, who is also a very acclaimed 
researcher in distributed systems. So, incidentally, he also works at LinkedIn. And the 


important thing is to look at something on the right, that people also viewed. 


So, this is also telling me that people who have seen Partha Dutta site have also viewed other 
similar people. And so, these are all of his connections, so I might want to connect with them. 
So, this is a collaborative filtering. You can think of it this way, that it is also telling me that, 


others have also viewed Partha Dutta, and I might want to view them. 
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And the key behind all of this is, of course, the Voldemort project, which is a distributed 
database. It is a key value store. Similar to the other key value stores that we have actually 
seen. The it is open source. So, the source code of this project is available on GitHub. And this 
source code can be used. We can make several configuration changes Voldemort has in 


memory caching, replication, partitioning all of that. 


So, what we will see is how was this made LinkedIn friendly? For the two things that I 
discussed? Which was that, how is my network managed? In a sense? How does LinkedIn 
suggest that I connect with new people? And also, how are my contacts managed? Let us say 
if I were to go to a contact, how does it show me that who were all the other people who have 
been viewing my contact. So, given that I have seen this, let me go back to the slide that I was 


showing over here. 
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And let us restart the presentation. So, now that you have seen that LinkedIn makes a lot of 
recommendations. So, Amazon also does that. So, Amazon suggests items that I may buy, 
LinkedIn suggests people that I may connect with, it looks at relationships between pairs of 


people. 


So, as of the date of publication of this paper, which was quite a while ago, we had 135 million 
LinkedIn users. Now, of course, we will have much more. And also, we will have many, many 
relationships among these people. And these relationships have to be mined. So, we have to 
collect the data, we have to process the data. So, all of this data about mean is to be collected. 


So, as you just saw, everything that I do, via LinkedIn is kind of tracked. 
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And by that LinkedIn has an idea of what is it that I like, what is it that that dislike, so the fact 
that I like economic news, maybe my wife does not know, but LinkedIn knows. So, this is to 
the extent that online sites have an idea about us, which is okay, in the modern world. And that 
is how they make recommendations to me, which means there is a huge amount of data 


collection, all of this data is processed. 


And when I open the site, all of this data is served to me. And I can quickly see, so of course, 
all three of these tasks are not happening at the same time. Instead, they are happening at 
different points in time. And that is the wonderful part of this. And that is why LinkedIn is such 
a successful site, such a successful professional networking site. And as you have just seen, I 


use it a lot. And I love using it. 


And the reason is that a lot of information is given to me regarding what I should read, who I 
should connect with, what are their connections, what is the relationship between the 


connections, I kind of get to see all of that. 
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So, Voldemort is a key value system. It is now open source, LinkedIn uses it. But Voldemort 
is used by several other sites. It is used by eHarmony, which is a very common matchmaking 
site in the US. Even Nokia also used to use Voldemort, I am not sure about it right now. So, 
Voldemort has an internal Hadoop engine, which is a MapReduce engine, which processes all 


the data in LinkedIn data store. 


And it essentially creates a read only database that Voldemort uses. So, what is the key idea? 


Well, the key idea is that outside the Voldemort system, we have a Hadoop system that takes 
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in all the information, processes it, makes it ready, and creates a kind of a read only version of 


data. Gives it to Voldemort such that Voldemort can serve it to customers. 


So, what we actually need is we need a quick update system, where Voldemort can quickly 
read whatever the Hadoop system is generating for it. And we need load balancing, we do not 
want any server to get overloaded with data, which has been a constant concern throughout the 


designs of this course. 


So, the main aim over here, so if you would see our last lecture was on Facebook’s photo 
storage. So, the photo storage was mostly read only data. But this data is even more read only 
in the sense that it is not modified in the sense that I cannot go and let us say modify the list of 
recommendations that LinkedIn is making. In fact, nobody can, only LinkedIn can. So, all of 


this data is totally read only. 


So, what happens is that the Hadoop system produces a tranche of data and the entire data is 
sort of bulk loaded into Voldemort. And this and all of this data is read only. And so, the all of 
this data is read only, so it comes into Voldemort. And we need to ensure that this is much 


faster than existing solutions. 


An existing solution, the default solution is a MySQL database. And LinkedIn system is 
reasonably faster twice as fast as MySQL. So, we load 4 terabytes of new data every day and 


so, that was the date of publication of the paper. And nowadays, it must be much more. 
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So, there are two MySQL solutions. So, MySQL has two native data storage formats, MyISAM 
and InnoDB. So, MyISAM is a compact on disk data structure. So, it creates an index after 
loading the data file. So, the even though the MyISAM system is reasonably fast, and it can be 
used for this purpose, the main problem is that it locks the complete table during the process 


of loading. 


So, this process of locking the entire data set unnecessarily causes a disruption in the quality 
of service and this is something that is clearly not desirable. So, this is not something that you 
would like the other option is InnoDB. So, InnoDB supports far more fine-grained row level 


locking. 


So, it is possible to maybe read some data and whatever we are reading and updating only those 
rows can be locked. The main problem is a rather slow process, it also requires a lot of disk 
space. The main reason being that since rows are being accessed kind of individually, we need 


a lot of disk space to actually store all the indexing data structures. 


So, here also there is a overhead of locking. And in InnoDB, there is an overhead of, excessive 
slowdown because of indexing. So, then Yahoo has a PNUTS system, where CPUs are shared 
between the data loading modules and the data serving modules. So, even though this sounds 
is a good idea, in the sense that we can have a very flexible structure, where there are some 
data loaders, and there are some data servers, it reduces the performance as a whole. So, the 
LinkedIn engineers did evaluate all of these solutions. And then they came up with a system of 


their own. 
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Overall Structure 
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So, the overall structure is like this, that Voldemort cluster contains multiple nodes. And a 
physical host in the multiple nodes. So, these are like virtual nodes in chord or virtual nodes in 
Dynamo. And a physical host can run multiple nodes. Each node has a given number of stores 


or database tables. And each stores some information. 


So, what is the information? Well, if I saw, let us say that for a given member ID, which is me 
if it is recommending a couple of LinkedIn groups, for example, then it can store the list of 
recommended LinkedIn groups, the recommended Group IDs, and also for each group ID it 
can store a description. And so, then the attributes of a store can be the replication factor on 


how many physical machines is a store actually replicated? The size of a read write Quorum? 


So, I am deliberately going fast on this, because read quorum, write quorum, we have all 
discussed in great detail when we were discussing the other projects like percolator and 
dynamo, where the read write quorum, where the basic idea was discussed. And recall that if 
R+W >N, where R is the size of the read quorum, W is the size of the write quorum, and N 


is the number of replicated notes. 


So, what do we do? Well, the baseline is that this uses a DHT similar to dynamo so on almost 
all of these cases, a DHT is used. Of course, we have replication in the sense that it is a chord 
like system, but it is a one hop DHT, and every node is assigned to N of its successors subject 


to the successors being on different machines. 


And within the successors, we call it a preference list. Again, harkens back to Dynamo. So, I 
would request the readers to take a look at Dynamo. And of course, R + W > N. So, for 


transferring data we use this data needs to be serialized, which means that any object has to be 
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converted into equivalent text. So, we use XML or JSON. And also, for transferring textual 
data, we use compression. The storage engines can either be Berkeley DB, which is optimized 


for small files, or a traditional MySQL database on the store. 
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So, the idea here is also similar to Dynamo and that we have a client API. So, how does this 
work? Well so we have these Voldemort clusters within LinkedIn. And the client is not really 
the browser, but the client is essentially also a LinkedIn node, which is rendering a page for 


me. 


So, the client API would essentially send a request to another network client or a network 
server, which would access the storage engine, return all the versions, serialize and return as 
versions. And of course, the clients need to do a conflict resolution, if there are multiple 
versions. So, since conflict resolution we discussed in great detail in the lectures on CODA and 


Dynamo. So, I am not going over it. 
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So, let us now look at the routing. So, the routing module deals with two things partitioning 
and replication. So, partitioning is the way in which we actually partition the hash space and 
divide it among the nodes. And the replication is how a given data is replicated among the 
nodes on the DHT. Let us say in the DHT. So, what we do is that if you would recall, we had 
a scheme in Dynamo, where we assigned tokens, and it splits the hashing into equal sized 


partitions. 


So, we use exactly the same scheme over here, where we take the hash ring and I may split it 
into equal sized partitions, then what we do is we map the partitions to the nodes. So, each key, 
so the moment that, so we have any data, any member ID, or anything, we want to locate on 


the DHT. Well, we do the same thing. Same idea of hashing, and we map it to a preference list. 
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So, the preference list will be approved will be a preference list of partitions. So, we can say 
that look, if the key is mapped over here, on the DHT, we will have a set of partitions, a set of 
partitions that are kind of like the preference list of the key. So, of course, there will be a 
primary owner, which can be just the immediate clockwise successor of the key that can be the 


primary owner of the key, the main partition. 


And when we walk kind of clockwise, we will have N - 1 subsequent partition whose main job 
is to store the replicas. So, this if you would see, this is a very standard architecture that we 
have been following in this entire lecture series, where we take a DHT we partition it, so this 


is a very Dynamo like fixed partitioning. 


Where a key simply mapped to a partition which is clockwise successor and the N - 1 
subsequent partitions subject to the fact that they are on separate physical nodes or we can also 
say that they are on separate data centers for added reliability. So, this is the broad idea, 
Dynamo used exactly the same thing. And the storage layer is pluggable. So, we have 


traditional get input functions. 


We also have block read functions in the sense that reading data, we support a streaming read. 
And the main aim here is that, we need to support primarily read only operations. So, we will 
see what modifications had to be done to an otherwise vanilla Voldemort system. So, a system 


is called a vanilla system and when you are using its base version, but in this case, we are not. 


So, we are somehow modifying Voldemort to take in data that Hadoop is generating. So, a 
baseline Hadoop system kind of generates all the data, does all the data mining does everything 
and kind of bulk loads it into Voldemort, and Voldemort has its DHT which serves as the data. 
And also, some administrative functions like adding deleting nodes, et cetera, all of them also 


have to be supported within this setup. 
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So, there are two kinds of routing. So, this was also we are looking at one hop routing. So, here 
we make a small departure from Amazon Dynamo. So, we support client-side routing, which 
means that a client is a machine that is lying outside the DHT. So, the client can directly send 
a message to the partition to the servers or the partition for this specific key, but of course, in 


that case, the client needs access to all the routing tables. 


So, then in that case, rather the what was our traditional routing algorithm, it was that we 
contact a given server that contacts a few more servers, and then kind of the request jumps, 
jumps, jumps and reaches the right server. In one hop routing, also, in Dynamo, we contact one 


server, over here, then it does a single hop insensate over here. 


But in client-side routing, the client has all the routing information. The traditional approach is 
of course server-side routing where the server makes all the routing decisions. And this option 


is also supported in Voldemort, both are supported, both client side as well as server side. 
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So, now let us so as far as what we have seen up till now, it is very similar to a traditional 
vanilla DHT system, client-side routing was new, but most of it is otherwise the same. So, let 
us look at the storage part. So, first, we will start with the shortcomings of MySQL and 


Berkeley database. 


So, here we are talking of bulk loading of data, where a Hadoop cluster kind of all night 
processes the data that is there in LinkedIn. And once the data is in a nicely processed form, all 
of it needs to be loaded into Voldemort. So, for this model, having a long sequence of put 
requests is not the right solution, not the correct solution. So, what? So, the MySQL uses a B 


plus tree that needs to be updated multiple times, that is rather inefficient. 
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An alternative solution might be that we maintain a separate server that maintains a copy of the 
database, we first populate that copy of the database, and we switch to the new copy 
instantaneously, so we maintain 2 copies. So, this has several disadvantages as well. The first 
is that we are doubling the number of resources, the amount of resources, there is an overhead 


of bulk copy of creating the indices, managing the indices, all of that is there. 


So, let us look at another alternative solution, which is we run the Hadoop system to generate 
the indices offline. So, in this case, you have one problem was you generate indices, well, you 
also generate the indices offline. And then you load in the indices as well. This is fine, it is 


doable that is maybe the way that many smaller systems actually operate. 


But here we need a lot of additional resources and Hadoop along with kind of processing 
LinkedIn data has to do a lot of work for the relational database. So, what we saw also in the 
case of Facebook systems is that we will not actually use a relational database management 
system, this is not something that we will do as a primary source of data. This is not what we 
will do instead, we will create a custom solution, which is anyway the job of a distributed 


storage engine. 
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So, what are the requirements? Well, the requirements are that we will minimize the 
performance overhead of live requests, any requests that we get, regardless of whether the 
previous Hadoop data is uploading or not, we will minimize the performance overhead, we will 
of course have fault tolerance, which you always have because of replicas will make the system 


scale in the sense we will be able to add more servers as and when required. 
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Also, there is a possibility that there might be errors in whatever Hadoop computes. So, it is 
possible that if Hadoop is giving us a lot of data, the data might actually be invalid. There might 
be errors and then you know, we might have to jump all of it. So, in this case, we need to have 
a fast rollback capability to a previous version. That is because for a system as large as 


LinkedIn, which kind of relies a lot on the relationships that it has computed. 


If let us say there is a bug in the code, or if let us say that there is a lot of superfluous data, we 
do not want to actually end up with a bad state. So, what we would instead like to do is that we 
would like to roll back to a previous version, where the data might be slightly stale, but it is 
known to be correct. And if you see most of us also do not update our contact list that 


frequently, so it will still kind of work. 


Also, our system should be able to handle large datasets, say any kind of a Voldemort’s system 
should, in a sense, provide all 4 of these requirements which any distributed system should also 
ideally provide. And but additional so, scaling fault tolerance, ability to handle large datasets, 
we have seen that. The two aspects that are kind of new over here is the first the fast rollback 


capability. In the sense we have not seen that before. 


So, for example, we did not see a rollback in let us say Facebook photos for example, but in 
this case, the entire computed data is very sensitive. So, just in case, we feel that there is some 
error, we would like to roll back to a safe previous version. And also, we would like to minimize 
the performance overhead of live requests, because we are never really bulk loaded, read only 


data, which we are doing right here right now. 
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Let us now go through the exact sequence of steps that typical LinkedIn load would follow. 
So, the first step is like this. So, the first step is like this, that a driver program would send a 
message to the HDFS, the Hadoop file system to trigger a build. So, this is the first step. So, 
the Hadoop plus HDFS systems, they would essentially tree start the build. 


And so, what does it mean? Well, what it means is that they will fetch all the raw data, all the 
LinkedIn data, they have access to it. So, all the LinkedIn data, the Hadoop system, will take 


and process and from the data, it will figure out the list of recommendations that you just saw. 


It will figure out the list of let us say if I access a certain person’s profile, it will also provide 
the names of other LinkedIn members that whose profile I could go to so on and so forth. So, 
the driver program sends a message to the Voldemort cluster, asking it to trigger a fetch. The 
Voldemort clusters a once it is a once the entire data has been generated. The Voldemort cluster 
initiates a parallel fetch from all the Hadoop nodes. So, pretty much we have a set of distributed 
Hadoop nodes. And similarly, we have a Voldemort cluster. And essentially, we get a parallel, 


it triggers a parallel fetch. 


So, we have a lot of parallel IO that keeps happening at this time. And then once all the data 
has been brought into the Voldemort cluster, each of the nodes will actually have 2 versions, a 
new version and the old version. So, atomically, the driver program will send a message to the 
Voldemort cluster to trigger a swap and the Voldemort cluster would execute the swap, which 
means atomically replace it will have a pointer. So, the pointer was, let us say previously 
proposed pointing to the old version of the data, it will immediately start pointing to the new 


version of the data, which is what we wanted. 


So, it is an instantaneous swap. So, we load the data in the background, we do not have to lock 
anything there is no disruption in service and instantaneously the pointers switch from the old 


version to the new version. 
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The storage format well, so what we discussed? So, we did discuss on a previous slide a storage 
format that could be used by a MySQL implementation. So, in this case, we are not discussing 
the MySQL implementation of Voldemort but a custom implementation. So, Voldemort is Java 
based and it uses the OS, particularly the page cache of the OS to manage its memory. And it 
assumes that the OS is doing a good job. So, what we do is that we the input data destined for 
a node is split into multiple chunk buckets. And each chunk bucket is split into multiple chunk 


sets. So, what is a chunk bucket? Well chunk, for a given partition. 


So, of course, for a given partition here is a primary partition not the secondary one. So, for a 
given partition all the data that maps to a given node, we essentially take all of that and we 
create a chunk bucket. And of course, we will have multiple such chunk buckets for each 
replica. So, we can think about this. So, the chunk bucket can be uniquely identified, or can be 
uniquely named, with the idea of the partition. And the idea of the replica, which replica it is 


storing. 


So, each chunk set. So, within a chunk bucket, of course, we have a chunk sets. And each chunk 
set, of course stores some data, so we will see what the data is. But in this case, each chunk set 
is organized similar to Facebook’s photo storage, where we have a data file. And so, data where 


we have a data file and an index file. 


Fair, of course, the spirit of the design is that the index file will be there in memory, and the 
data file will be there in disk. And the naming convention that we follow over here is that 


naming convention will be the partition ID, the replica ID. So, partition is essentially a range 
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of the hash space for which, so any key that falls within the range of the hash space, we are 


storing the data for it. And the replica ID, so which replica is this. 


And so, this partition, and replica will essentially pointers to the chunk bucket. And within the 
chunk bucket, we will of course have chunk sets. So, then we get the chunk set ID. And then 
of course, the last will be either the data, data file or the index file. So, the entry in the index 
file will of course be the top 8 bytes or the MD5 signature of the data. And so basically, the 


index file will have two things. 


So, the first set will have two fields, the first field will be that given the data that we are trying 
to access, what we do is we compute an MDS signature of its. So, first MDS5 is another hashing 
algorithm, similar to what we have seen similar to SHA1 that chord uses. So, we use the top 8 


bytes of it. 


And then once we are able to find the index entry, we use a 32 bit offset into the data file, 
similar to what Facebook’s photo storage used to do. So, what do we do? Well, what we do is 
that we search through the index file, till we find the entry that we are interested in. This will 
be the top 8 bytes of the MDS signature. And from this, we find a 4-byte offset into the data 
file. 


Now, that at this point, I have kept the discussion reasonably independent of LinkedIn. This is 
because Voldemort is a generic system. And so, till now, I have not introduced any LinkedIn 


specific things. That is the reason what I have maintained is the 8 bytes MD5 signature. 


And, of course, this is the MD5 signature of the key. So, previously, what we were doing is 
that we were taking the data of the key and computing in SHA1, instead of SHA1 that we have 
used in all the DHT that we have seen up till now. We use another hashing algorithm, which 


does the same thing. 


So, what does the DHT do? What does the DHT does is given a key it gives us the value. How 
do we do that? Well, we compute the hash of a key and we find the node that stores it. How 
did we compute the hash? Well, we use the SHA] signature, instead of SHA1. In this case, we 
just use MDS that does the same thing with MDS has some better properties with regards to 
uniformity. That is the reason it is used. And of course, we use the top 8 bytes of if not all the 
bytes. And then we locate this on the index file inside a given chunk set, of course, and that 
points us to the data file. so, the main aim over here is that given a key, we would like to go to 


the right partition. 
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And then that is the first thing and we would essentially route ourselves to the node that 
contains the partition. Within the partition, we will, for a given replica ID, we will try to find 
the chunk set ID. And in the chunk set ID, again, we will have an index file and a data file. 
And here we will use, we will not use SHAI signature, but the MD5 signature to essentially 


locate ourselves in the index file. And then there will be a pointer into the data file. 
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So, it is possible that we can have hash collisions in the data file. So, we will have a number of 
collided tuples. So, the data file additionally will store a list of collided tuples that will store 
the number of key value pairs that have collided with a given key value. And the list of collided 


tuples will be stored at the end of the data file. 
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So, how do we generate chunk sets? Well, so that is the main idea that the input will be the 
chunk sets per bucket, the cluster topology, the definitions of where the stores are into HDFS. 
So, we will have a mapper that maps that emits the upper 8 bytes. So, the MD5 key along with 
some more data. So, this will be the node ID, the partition ID, the replica ID, the key and the 


value. 


So, the job of the matter is that when the HDFS system is generating data, it will generate data 
in key value pairs. So, we will discuss the exact semantics of what is the key and what is the 
value in the context of LinkedIn in latest slides. But for the time being, let us assume that it is 


a key value pair. 


So, what we do is that the job of the mapper is to take a key and a value computer 75 Key and 
map it to a given node partition. So, essentially map it to a set of replicas. So, first map it to a 
node. And then from the node, we will find the partition that is the on and off, then we will find 


a set of replicas, store them in those replica servers for that particular key value. 


So, the partitioner, what does it do? Well, it routes data to the correct reducer, based on a chunk 
set ID, what is a reducer? A reducer further processes the data. So, recall that we are doing a 
MapReduce kind of computation. So, for every chunk set ID we first extract all the relevant 
raw data from HDFS, then what the reducer does is that it processes all of that data and stores 
it in the node ID that we have computed in the past. Say every reducer, let us call the reducer 


a dedicated process is responsible for only one chunk set. 
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So, what Hadoop would do is it would sort the inputs, all the inputs based on the keys, and each 
Voldemort data, each Voldemort node, each node is Voldemort, is a directory in the baseline 
HDFS file system. And we have a separate file for every chunk set. So, essentially, for every 
chunk set, we have a separate file, and we generate that in Hadoop. And that contains all the 


key value pairs that we would be interested in. 
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So, given that given store is represented by a directory, which we saw in the last sentence over 
here, that every store, every Voldemort node, which is essentially a store is represented by a 
directory. This allows us to store versions very easily in the sense that every version of a store 


will have a unique directory. 
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So, we can say that let us say for a given store, this is version 1, for the given store this is 
version 2. So, the current version of a store would point to the right directory using a symbolic 
link. So, which is very easy. So, let us say that for given store S that will just be a symbolic 
link in the file system. It will previously be pointing to version 1 will make it stop pointing to 


version 1, will make it start pointing to version 2. 


So, for moving to a new version, what we will do is we will get a read write lock on the previous 
version’s directory, will close all the files, will open new files in a new directory, map them to 
memory and then switch to symbolic link. And so, basically it is important that when we switch 


the symbolic link for the new directory, all the files are already map to memory. 
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Data Loading, how does it work? Well, that driver program that we have that triggers a Hadoop 
job to create a new version. Once the new version has been created, so what does this mean? 
Well, the new version over here would essentially create different chunk sets per chunk bucket. 
Once all of that has been created, and pretty much for every chunk bucket, we store it in a 


directory, in the directory or multiple files, where each file is a chunk set. 


Once we have done that, the Voldemort nodes pull these directories into the local file system. 
And then atomically, do a swap as we are mentioned. So, in the version of LinkedIn that 


corresponded to this paper, this was a very, very fast operation. 


And so, which was this the swap process, this was a very fast operation, it took around 0.05 
milliseconds. So, which is 50 microseconds to actually do the swap. And this happened, 


atomical. 
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Now, a little bit of search and load balancing. So, the retrieval goes like this, that we calculate 
the MDS hash of the key. And from that, we generate the ID of the primary partition. If we are 
going to a replica, then we can find the ID of the replica. And the idea of the chunk set, which 
are set to be the first 4 bits of MD5 key from that we access the machine on the DHT and we 


find the junk set index file. 


From the chunk set index file, we locate the values within the chunk set data file. So, how do 
we do that? So, in an index file, let us say that based on the bits from the MD5 hash, all the 
keys are stored. So, one option is that we do a binary search. So, if we do a binary search, it 
will take us all around login time. This further might involve multiple discrete is a part of this 
is stored on disk, what we can instead do is we can perform an interpolation search, which 


means that we take a look at the keys. 


And based on that since they are in sorted order, we make an approximate guess of maybe the 
keys lie in this range. Once we go to that range, we again make an approximate guess and keep 
on trying to find (())(42:54) in the range. So, interpolation search is faster in the sense, it is 
faster than binary search is O(log(log(N)))). But of course, the search is probabilistic. In the 
sense it is a probabilistic random algorithm, which kind of relies on an expected distribution of 


the keys. 
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Now, for rebalancing and schema upgrades so well, we might want to change the schema of 
the chunks or the chunk sets. So, to do that, we will actually use version bits with every JSON 
file. So, JSON is the textual as is the text format is similar to XML. It is a text format in which 


data gets transferred between Hadoop and Voldemort. 


So, every JSON file will have some version bits to indicate the specific schema that it is using 
in this meta data. And also, for rebalancing to handle any additional pressure on the nodes. We 
can dynamically add partitions, we can create a plan for moving partitions and the replicas to 


new nodes. 


And also, we can start moving the partitions and lazily propagate information about the 
temporary topologies. So, that also can be done. And so, all of this kind of support the elasticity 
of the server farm, in the sense that as we keep on adding more servers, our system gradually 


keeps on growing. 
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So, the experimental setup, well, we had a simulated data set. The keys were random 1024- 
byte strings. So, in an actual LinkedIn system, it may not be this way, the key can actually be 
my member ID. So, my member ID, Srsarangi, that can be my LinkedIn key. So, what 
Voldemort would do is if it would like to fetch all the recommendations, so the value would be 
all the people that I know, all the people that I should know, all the LinkedIn recommendations, 


this can be the key, and this can be the value. 


Alternatively, let us see if you visit my profile. So, then the key again will be my user ID, and 
the profile will be see and let us say on the side, when you get the list of people that have also 
visited my profile, that could be the value. But for the sake of this experiment, the keys were 


just random 1024-byte strings. 


Baseline Linux system was used with 8 cores, 24 GB RAM. And the number of nodes was set 
to 25, with roughly 940 Giga Bytes data size per node, 123 stores any replication factor of 2. 
In a sense, each data item was replicated on 2 servers. The size of the store data did vary, it had 
a very wide variance from 700 kilobytes to 4.15 terabytes. And the maximum number of store 


swapes per day was set to 76. 
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So, what was observed is that the building time as we increase the file size from 1 GB to 1700 
GB. If let us say I were to use a MySQL database, the build time for MySQL would increase 
linearly from O till 350 minutes or 125 gigabytes. Henceforth, the overhead of storing 
something in MySQL, in terms of calculating the index structures and so on is so high, that it 


becomes an order of magnitude slower than Voldemort. 


But for Voldemort, even with 1700 gigabytes, within 40 minutes, the entire data can be 
generated as well as stored in Voldemort’s format. So, of course, when we are comparing 
between MySQL and Voldemort, the inherent build time is the same. But to bring it to my 
SQL’s format, which means to create the indexing data structures, it actually takes a very long 
time. In Voldemort, we do not need that much of time, because essentially, we are partitioning 


all this. 


So, let us look at it in a different way. If we are creating a billion key value pairs, and you are 
storing them in MySQL. MySQL will have to create a very large indexing structure, which 
takes time. In this case, what we are doing is we are kind of partitioning the keys based on the 


hash of the key, we are doing a hash space partition. 


And for every partition, we are essentially treating that as a chunk set, and we are creating a 
small file for it within the store. This is much faster. So, we will see Voldemort’s internal 
system is much faster. The Read latency versus the time swap well, for MySQL, the median 
read latency reduces from 30 milliseconds to 2 milliseconds, mainly because most of this data 


comes into the cache. 
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For both interpolated and binary searching these values for Voldemort reduced from 2 
milliseconds to 0.1 millisecond. So, Voldemort is still within the same range still 20X Faster, 
which is expected. Given that it has the small index structures for the small data files, instead 
of one large index structure for the entire database. So, the 20X speed up here is kind of 


expected. 
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And if I were to compare the qps of the queries per second, for the same 100 gigabyte data set, 
the throughput was varied and the latency was measured. For MySQL, the latency decreased 


for 100 qps, it was 1.7 milliseconds. For 420 qps, it was 3, roughly 3.3 milliseconds. 


In comparison, Voldemort was latency increased far more gradually. So, it will we were able 
to scale it, the authors were able to scale it from 1.2 millisecond for 100 qps. Just compare the 
numbers over here and here to 3.5 millisecond and 700 qps. Just compare these 2 lines, which 


means that for the same latency, Voldemort can support roughly twice the throughput. 
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Then so, this the previous experiment was on randomly generated keys and data. In this case, 
it was done more with a LinkedIn specific data set. The first is people you may know, which 
is something that you saw when I showed you the demo, which is a suggested set of users may 
like to establish a connection with and other was collaborative filtering, which was profile 


similar to the visited members profile. 


The latency was plotted after a swap. In both cases, the latency is decreased sub linearly. And 
CF has a larger latency than PYMK, because of the larger size of the value, but of course, this 
depends on how many contacts are shown. And on the LinkedIn version. Nowadays, of course, 
the people you may know, has increased to a very long list. But so, this can be this is kind of 
an idiosyncrasy of the specific LinkedIn version. But the important point is that both of them, 


both of these operations were really fast. 
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So, this brings us to an end of this lecture. This paper was published in 2012. And it has been 
8 years ever since. So, in these 8 years, the number of LinkedIn users have clearly increased 
by leaps and bounds. So, we need to wait for more information in terms of research papers to 


understand how exactly the increase in scale has been handled. 
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In this lecture we will discuss the Condor distributed job processing system. So, we will first 
go through an overview of what a distributed job management or a distributed batch processing 


system is and then we will discuss the main modules of Condor and the detailed operation. 
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So, the idea was that towards the mid-80s, the power of distributed computing was realized. 
So, it was realized that a single machine, regardless of the size of the machine, be it a small 
machine like a desktop computer or a large machine like a mainframe, or even a supercomputer. 
That is not sufficient for most problems. And also, even if it is sufficient for a very restricted 


set of problems, and it is not very flexible. 


So, the idea was to create a cluster of machines where, so these could be normal regular 
machines in the lab aggregate their computing power. Such that they could outperform 
supercomputers. So, that was the idea. So, what led to this is that by the end of the late 80s, and 
early 90s, many households, particularly in the US and Europe started to have desktop 


computers. So, these computers were not used at nighttime. 


So, the idea was that, why not, we take over these computing resources when they are not being 
used, and kind of have a large distributed system of all of these comprising all of these small 
desktop machines that each of the individual users had. So, essentially a cluster of desktops 
that did not everybody had, which people had in their homes, and then a take a large piece of 
work divided into small chunks, and distribute these small chunks of work across these 


machines. 


So, this did sound to be a very interesting idea. The reason being that there was a huge amount 
of compute capacity that was not being utilized. And this could be utilized for problems that 
did not require the kind of synchronization, which a typical, job on a supercomputer would 


require. And that would be something like inverting a matrix or solving a large differential 
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equation. So, for those supercomputers are better. But for other jobs, where, for example, we 


need to run a large simulation with a large number of parameters. 


And if they and let us say that we want to run just 100 copies of the same simulation with 
different parameters, we can distribute one copy on each desktop. So, there was a need for a 
middleware. So, middleware is kind of like a runtime system over the operating system to 


integrate all of these third-party computers. 


So, idea was to integrate computers with different types of hardware and software, provide 
consistency and reliability guarantees. So, why consistency? Because we need to have some 
idea of how these jobs execute. So, they may not be fully serializable. But at least, there should 


be some rules to the game. 


And second, we need some reliability in the sense, we should be able to trust the results that 
we get in terms of their correctness, as well as the security and trust aspect, where essentially, 
if a job is running on a remote desktop, it should not be possible for the job to hack into the 


desktop. And it should not be possible for the desktop to tamper with the job. 


So, these are third party jobs. So, pretty much let us say that today if I put my desktop on this 
desktop cloud. So, this thing has a new name, also is called the desktop as a service. But let us 
say it is a compute cloud. So, if I like to put my desktop on a compute cloud, then the idea is 
that jobs then can run on my desktop. And but the thing is that I will not know which job is 


running on my desktop. 


So, it is possible that it might try to corrupt my system or the converse is also possible where I 
have a malicious intent, and I am trying to corrupt that job. So, for both the cases security and 
trust are required. Additionally, if we have a large distributed system of machines, we need 
some notion of fairness. Otherwise, we will not be able to guarantee the timely completion of 


these jobs. And second, it should be easy to efficiently run such large-scale distributed jobs. 


So, the first such system to actually propose this was the Condor project, which initially was 
born in the University of Wisconsin, and later on, as it kind of grew. So, now, condor is like an 
open-source project. So, this is something that anybody can download and run. So, as long as 
there is a cluster of machines, we can install Condor on all of them. And to any outside user, 


they would actually appear, the Cluster would actually appear to be a single machine. 
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So, what are the advantages of Condor? Well, the main advantage is flexibility. Because we 
can take a lot of desktop power that we are getting. And then we can sort of integrate all of it 
into one single large cluster machine. So, much of this also started with the SeTI project. So, 
the SeTI project was a project to find out if extra-terrestrial life exists. So, for this, a lot of 
information that was actually captured by telescopes had to be analyzed to find traces of life 


outside Earth. 


So, the SeTI project required the combination capacity that it needed. That was not available 
formally. So, that is the reason these desktops were taken over, where people pretty much 
contributed compute power to a cloud. And the cloud then ran jobs on them. So, what did users 
get in return? Well, they got some money. So, in all such cases, these need not be fully 
philanthropic altruistic activities. If I am contributing compute power, I get some money in 


return. And that kind of allows me to sort of invest in more compute power and contribute. 


So, in this case, communities grow naturally. And so, they build a bond of trust with other 
communities such that we have a large cluster of machines with a large amount of compute 
power, and this compute power can be used to support the needs of a very large community 
where each job can be very different and the jobs will be heterogeneous, but on this large set 


of machines, it would be possible to run them. 


So, condor has some basic principles, the first is that we leave the owner of the computing 
resource in control. Otherwise, owners would not come forward to actually donate their 


machines to the pond or cloud. So, it should be possible for an owner for example, to terminate 
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the remote job and do work of his own, the system should be fault tolerant, which means that 


if one machine crashes another machine can take over. 


And also, it would be possible to lend and borrow concepts in this in the design of Condor. So, 
many concepts have been lent and borrowed from other disciplines, particularly from parallel 
computing, distributed computing, to actually create the system and also, traditional things like 


Semantic web as we will see. 
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So, Condor provides a method for a set of users to of course, submit their jobs. So, this is done 
in batch mode. These are not interactive jobs, they are batch jobs, which means that they run 
and whenever they are done, the user gets an email that the job is done. So, Condor provides 


job management mechanisms. This means that it provides methods to manage a running job. 


Scheduling policies, which tell the user how the job should actually be run. Resource 
monitoring, which is to monitor the resources, how much they are being used, and finally 
manage the resources. So, Condor is not really one large pool, but it is actually a collection of 


several small pools or machines. 


So, we can have one small pool of let us say the HT Delhi machines, another pool of another 
machines. So, we can aggregate all of these and make one large virtual pool of machines. So, 
in this there are two important concepts that I would like to mention one is called planning and 


the other is scheduling. 
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So, planning is when we take a large job. So, let us say a large job requires 200 machines. And 
we decide that we will take 100 machines from here, 50 machines from here and 50 machines 


from here. So, this is broadly planning. 


Scheduling is the next phase for each of these individual clusters, decides when and how to run 
these jobs, because they will have local jobs as well. So, when and how to run these jobs such 


that a given metric is maximized or minimized. 
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So, this is a broad idea that we have a large cloud of compute nodes called the Condor cloud. 
And in each one of them, we have individual users who submit jobs, the cloud executes jobs, 


and then finally, the results are given back to the individual machines. 
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So, now, we will discuss the main modules of Condor. So, the main modules in a condor system 
are as follows. So, we have a ClassAds system. So, this is a language that lets users specify the 
type of the job. So, this can include the type of the machine that is required the operating 
system, the kind of software environment that is required, and the type of the resource that is 


offered by the cloud. 


So, this is used to specify both the job as well as the resource and any kind of a matching policy 
because after all jobs have to be matched with resources. Then we have an execution engine, 
which executes the user’s jobs. So, it respects the constraints of it of a job. So, job may not be 
a single program, but it can be a set of programs, where the output of one program feeds into 


the input of the other program. 
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So, in this case, the execution engine respects all of that, and it runs the jobs on a large grid of 
machines. So, typically, the constraints are shown are represented using a directed acyclic 
graph a DAG. Furthermore, a great feature of condor is that it allows job checkpoint and 
migration. What this basically means is that let us say that on a machine a job is running. After 
that in the middle of so let us say this the execution time of a job. In the middle of a job, the 


user presses a key, which means the user is now active, the owner of the desktop is active. 


In this case, we checkpoint the state of the job. And we migrate the job to another machine, 
where it starts at exactly the same point in its execution, and continues to execute. So, the 
checkpoint and migration is a key feature of Condor, which allows even running jobs to 
seamlessly migrate between machines. Furthermore, we will define something called a remote 
sandbox, which kind of dictates how jobs run on remote machines. Well, let us defer this 


discussion to a later slide. 
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So, the structure of the Condor system, the Condor kernel is like this, we have the user. Then 
we have the problem solver. So, the problem solver is essentially a software module, which 
takes the jobs, the structure of the jobs that are represented as a directed acyclic graph and for 
each job. So, the job is given to a condor agent. So, the agent registers itself with a matchmaker, 


which tries to pair it with a resource. 


So, the role of the matchmaker share is to pair a job with a resource. And once a job gets the 
kind of resource that it requires from the Condor system with the help of the matchmaker. So, 


after that, what happens is that the agent starts a process called shadow. So, the job of the 
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shadow is to help the sandbox on the resource run the job. So, we will discuss this in detail 


later, how the shadow and the sandbox actually talk to each other how they collaborate. 


And then the job runs on the sandbox on the resource. So, we can think of the sandbox as a 
secure software environment that runs the job. So, that is the main aim, the Sandbox is a secure 
software environment to run the job. Well, the job cannot maliciously access the resources, the 
access features of the resource, and the resource cannot harm the job. So, it is a two-way 


sandbox. 
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So, the flow of actions in condor is like this, the user submits a job to the DAGMan manager. 
It parses the DAG structure of jobs. And sands, then a job one after the other to an agent. The 
agent stores the jobs in persistent storage and find resources to run them. Agents and resources 
periodically send messages to a dedicated matchmaker whose job is to essentially pair jobs and 


resources it pairs agents with resources. 


Once the matchmaker reports a match, the agent checks with a resource if it is still available. 
Because it is possible that the resource might not be available because the local user on it is 
using it. So, then the agent spawns a process called a shadow to handle the execution of the 


job. The resource creates a sandbox to run the job. 
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So, we will not discuss Condor pools. So, pools are machines, pools are either agents or 
resources can get together and form a condor pool. So, Condor pool is a pool of agents or of 
resources. Every pool has a matchmaker. So, resource can enforce some policies regarding 
what kind of resources offered and the matchmaker can have additional policies. And so, the 
default is of course, that we get a machine in the same pool. But in the mid-90s, Condor started 


expanding this to get machines from remote pools as well. 
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So, there were two approaches gateway flocking and direct flocking, which was to get resources 
from remote pool. So, in gateway flocking, every pool will have a gateway that can interact 
with gateways of other remote pools. So, if this is a pool this will have a gateway that can 


interact with the gateway of other remote pools. 


And if let us say in this pool, there is an idle machine. Then what can happen is a gateway can 
let other gateways know about idle machine. So, if there is a request from within the pool, this 
can be forwarded to the other pool that has an idle machine. And then the idle machine can be 


assigned to one of the jobs in this pool. So, this way, we can forward information among pools. 


Direct flocking is when an agent reports itself to multiple matchmakers. It does not actually 
have to go through a gateway. So, it can directly report itself to multiple matchmakers in 


different pools, which means it effectively joins multiple pools and it gets resources from them. 
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So, direct flocking and gateway flocking used to be the main approaches used to be the 
mainstay of Condor till quite some time. Then in the late nineties, the Globus toolkit for 
managing grids of computers emerged. So, it was a very standard architecture to interconnect 
clusters and grids. It provided trust security and secure file access and transfer services using 


the gram protocol, the grid resource access and management protocol. 


So, then, along with direct flocking, which is when a agent kind of registers with multiple 
matchmakers in different pools, and gateway flocking where each tool has its gateway. So, 
essentially agent contacts its local gateway and then the local gateway contacts remote 
gateways. A new an extension to Condor was added where the underlying architecture was 


actually the Globus cloud architecture. 


And the Globus cloud kind of took care, kind of virtualized the entire cloud system for Condor, 
such that Condor saw the entire cloud as one large cluster of machines, and the notion of pools 


kind of went away, because the entire cloud became a single pool. 
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So, with these three flocking approaches in mind, let us now come to matchmaking in Condor. 
So, the matchmaking is that both agents as well as resources, advertise their details using small 
snippets of text called Class Ads. So, the main aim is to pair agents and resources based on 


their Class Ads. 


And then once there is a match, the agent goes and claims the resource. So, it is of course 
possible that in the time being the resources may be dead, the resources may be busy. So, of 
course, availability needs to be checked. And if the resource is not available, then the 


matchmaker needs to be contacted once again. 
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So, a typical Class Ads would look something like this. So, my type, the type of the ad, so this 
is a job, it could be resource as well. The target type, the machine is the requirements of the 
machine are like this, the other operating system should be Linux, let us say, the rank is equal 
to the amount of memory we need. So, we will discuss the rank in some detail. So, the rank is 


pretty much a function that is used to evaluate the suitability of a resource. 


So, of course, the rank can be defined in different ways. So, the way that this Class Ads is 
defining the rank of a machine for a job is memory times 10,000 plus the KFlops. So, flop is 
number of floating-point operations per second. So, this is essentially specifying a function of 
how a job would assess a resource. So, then, so these are all the requirements of what kind of 


a machine we need to run on. 


And if there are multiple machines that match the requirements, then of course, we use the rank 
function and find the suitability of machines. And of course, we will choose the machine that 
is the most suitable. The command is the command that is that will be executed on the remote 
machine. So, in this case, it is abc dot exe directory, abc, and the owner so owner is, who is the 


owner of the job. 


So, owner of the job is myself, this is essentially my user ID. So, as mentioned requirements 
indicate the constraints. The rank is like an objective function of the match that we can use to 
assess how well the match is. And among the available resources, of course, a matrix the 


matchmaker chooses the resource with the highest rank. 
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So, further enhancements, future enhancements were made to matchmaking. Support was 
added for writing custom Java and C modules. For even more sophisticated matchmaking. 
Also, Gang matching was allowed. Which means that so many a times, we might want to run 
a program like MATLAB, that comes with a license on a given machine. So, in this case, we 


will like to co-allocate the MATLAB license and the machine. 


So, this kind of coallocation of more than one resource was is known as gang matching. So, 
suppose this was added because many times in a cluster, we like to run jobs that use software 
with certain licensing requirements. Or it is possible that certain licenses are available on 
certain machines, and they are not available on certain machines. So, gang matching kinds of 


allows that. 


Then the support for collections was added, which provides some support for saving Class Ads 
such that even if the matchmaker crashes, we can retrieve the Class Ads and do a matching 
once the matchmaker comes up. And then there was set matching where instead of matching a 
single Class Ads, we can match a set of Class Ads and also named references were added where 
essentially one Class Ads can refer to another Class Ads which means that if there is some 
information in this Class Ads, this is automatically being used by the one that it is, the one that 


is referring to it. 
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So, now, what, how far have we gone? Well, we have discussed the broad philosophy of 
Condor, we have discussed Condor pools. So, in this we have discussed direct flocking, 
gateway flocking and we have discussed running Condor on a great with its own middleware, 
which was essentially running Condor on the Globus toolkit. Now, we will discuss the problem 


solver, which is essentially the core execution part of condor. 
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So, condor has a master worker mode, where there is one master process that directs the work 
of many worker processes. So, the master process on each machine has a work list think of this 


as a work queue. So, this maintains a record of all outstanding work all the jobs that need to be 
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performed. Then, we have a tracking module which keeps track of all the remote processes and 


it alerts them work items. 


So, the weights the master worker model of condor is essentially there is one Condor master 
server. And there are like many of these worker kind of slave servers. So, in this master worker 
model or this master slave model, so, I should rather maybe use the master worker model, that 
is what the people refers to. So, the master maintains a work list a work queue of what work 


needs to be done. 


And within this work list, these are assigned to different workers and a tracking module tracks, 
how far the workers are progressed. And a steering module examines the results generated by 
the workers, modifies the workplace and coordinates with condor. So, here the idea is that 
workers can of course die at any time, they can crash at any time, the tracking module then 


returns the kind of undone work back to the work list. 


So, I start the work can be assigned to another worker. So, of course, in such cases, we assume 
that the job is side effect free, which means that it has not access to other resources. For 
example, if a job had to send a message on the network, and we are not allowed to send two 
copies of the message. Now, assume that the job kind of dies midway, if it is restarted, another 


worker will again send the same set of network messages, which will cause a problem. 


Hence, to ensure that there are no third parties affected, we typically prefer side effect free jobs, 
even though Condor can execute jobs with side effects. So, but then of course, a higher layer, 
a higher application layer has to take care of the problems, then the tracking module can also 
replicate work items. Again, for side effect free jobs, where if let us say a given work item is 


crucial. 


For example, it is possible that the DAG of jobs is like this, that job J 1 needs to be done first 
and then we can start J 2, J 3, J 4 and so on. So, given the criticality of J 1, it might be a good 
idea to run multiple copies of J 1 and use the results of that copy which completed first. So, 


this will speed up the entire system, because J | is on the critical path. 
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So, the jobs are specified as a DAG directed acyclic graph like, you run J 1 first and then J 2, 
then J 3. And then of course, you might have that okay, after J 3, you run J 4, but that J 4, but 


for J 4 to run you have to finish J 2 and J 3 first. 


So, in this way, you can specify a directed acyclic graph of jobs and Condor will respect those 
constraints. And furthermore, pre and post processing of jobs is supported, which means that 
if I have a given job, I can run a small program before the job and a small program after the 
job. So, what do they do? Well, the first program essentially ensures that the inputs for the job 


are in order. 


And the second program for post processing that ensures that the outputs are in order. It also 
verifies pre and post, it also verifies whether the job executed correctly or not. If it has not 
executed correctly, then the job needs to be put back in the Condor queue such that it executes 
once again. Now, assume that a given job fails assume that J 1 works correctly, J 2 works 
correctly, but J 3 does not run. So, because J 3 does not run J 4 also does not start. So, it is 


possible for Condor to print rescue DAG, which in this case, would only be J 3 and J 4. 


Given that the jobs J 1 and J 2 have executed, if there is some problem in J 3, the user can fix 
it, and just reissue J 3 and J 4. And that would complete the entire execution. So, it is possible 
to have a retry command where these two jobs can be executed once again. So, in this case, we 
do not have to run the entire DAG, we only have to run a part of it, which are the jobs J 3 and 


J 4. So, a distributed system with, in a condor like system would allow us to do that as long as 
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of course, the jobs are side effect free. So, that will allow us to run the jobs as many times as 


required. 
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So, now a little bit of the details of how exactly the process works? So, we will now discuss 
the shadow and the sandbox. So, recall that the agent creates a shadow process. And the 
resource creates a sandbox. So, what is the shadow useful for? So well, when a job is specified? 
It might refer to a host of things, it might refer to a host of input files, to some network 
connections, database connections, and anything else that is a function of the environment, 


including environment variables. 


So, to transport the entire environment, from here to there can be very expensive. It is like 


moving an entire container or an entire virtual machine from an agent to a resource. And this 
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is very expensive, it is like a full operating system move. So, this is actually not a very good 
idea. So, instead, what happens is a resource creates a sandbox, and the job runs within the 


sandbox. 


So, whenever the job makes a system call, which means that it let us say it opens a file, the 
sandbox intercepts the system call, sends the system called to the shadow. So, let us say given 
file foo or txt, the job wants to read the 10th byte. Then this information is sent to the shadow. 
The shadow opens foo dot txt, it reads a 10th byte and sense the 10th byte back to the sandbox. 


So, the sandbox creates a kind of a proxy environment for the job. 


It needs to ensure the job cannot harm the host. Well, this is automatically insured, because all 
the system calls for the job are intercepted and sent to the shadow. So, this is ensured, it needs 
to ensure that the host cannot harm the job. So, this is also ensured in the sense that the main 
way that the host actually harms is by giving it wrong arguments for system calls. But since 
the results come from the shadow, and the sandbox also has checks of its own. And 
additionally, modern sandboxes also run on secure trusted hardware, it is not really possible 


for the host to harm the job and the host tries something like this it will get detected. 


So, the long and short of this is that the host will not be able to tamper with the job. So, then, 
in some cases, well it is actually in most cases, we need to marshal and un marshal IO data. So, 
marshalling basically means sending data from the sandbox to the shadow. So, basically when 
we send data, the process of kind of making it, machine independent is marshalling. And then 


again reading the data and deciphering it for the shadow is unmarshalling. 


So, in this case, what happens is that the data is sent for every system call, the shadow executes 
the system call, and it sends it back. So, of course, modern versions of the Condor, we can, we 
can split this a set of system calls. So, instead of for every system call, we can say that look, 
the shadow will run, all the system calls for the file, maybe the database connection, these will 


run on the shadow. 


And the sandbox can run other system calls which do not really require the shadow for example, 
get time. So, get time we are okay. If it gets a time on the sandbox for example, or maybe send 
or receive a message on the network. Even that also, we are okay subject to some limitations 


with these can run on the sandbox. 
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So, of course, as I said, modern avatars of Condor allow you to write very, very flexible scripts, 
and also change the source code of Condor to ensure that a certain subset of system calls 


actually get handled by the shadow. And the remaining subset get handled by the sandbox. 
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So, the universe, what is the universe again? Well, the universe is a matching sandbox and 
shadow pair. And in this case, of course, the universe has a certain set of rules. So, what I just 
described is the standard universe, so you can have other universes as well. And we will discuss 
that. In this case, of course, the idea is that the entire environment of the shadow is not sent to 
the sandbox. Instead, the sandbox traps system calls. So, we will see how and it sends them to 


the shadow the shadow executes them and sends the results back. 
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So, the standard universe which is for Unix environment, the typical system calls that are 
actually, where actually data transfer takes place is IO. So, the shadow run IO server, it takes 
requests from the running job satisfies the request to the home file system, returns the data. So, 
the way that we actually do it, the simplest possible way, is that at compile time, user code is 


linked with Condor libraries. 


So, let us say a certain file command like open or F write or seek or something, read, write and 
seek, which are typical file access system calls. What happens is that all the library calls that 
call the system calls they here or something like for example, let us say F scanner, say F scan 
F in C allows us to read a line from a file. So, when I link a file with the Condor library, it kind 


of provides a wrapper on F scanner. 


So, the rapper actually ensures that there is a call to the shadow, the shadow executes that and 
it returns the results. So, this is how we are essentially defining a kind of virtual file system 
with the file system is resilient on the shadow. And every single sandbox where the job runs 
via marshalling these IO arguments and getting the data back, we have essentially ensured that 


even if the cloud node does not share the same file system, our job can still run. 


This is very important. Because the reason being that let us say if some remote job is running 
on my desktop. It should not access my files, it should access the files from where it came 
from. And this mechanism precisely allows us to do that. Additionally, a standard universe will 
provide support for check pointing which means that periodically, the entire state of the 


process, which is registers, the memory contents are all check pointed and they can be restored. 
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So, then, along with the standard universe, we have a Java universe. So, why was it necessary 
to create a special universe for Java code? Well, because Java code typically reads a lot of other 
files like Java libraries. It would not have been a good idea to read all of the, read the contents 
of all of these files over the network using the shadow and sandbox mechanism. So, instead a 
Java universe kind of places all the necessary class and archive files in the class path or the job 


on the remote machine as well. 


So, this kind of ensures that if I have the agent over here and the resource over here for any 
Java program running on the resource, if it needs any library file for standard Java functions, it 
need not come to the agent, it will find them locally in the resource itself. So, Furthermore, any 
kind of the so, this is clearly the biggest advantage that all of Java’s baseline boilerplate files. 
All of Java’s baseline files are available at the resource, so the job does not have to come back 


to the agent. That is point number one. 


The other point is that we can link the same way we do in C, we can link the job with a custom 
Java library. So, the Java IO interface is a wrapper on Java’s input and output streams. So, that 
does the same for regular files, sends it to the agent, and the agent sends it back agent runs the 
shadow, and the resource runs the sandbox. But this can be a smart interface, which can never 


do authentication, pass via firewalls and so on. 
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Now, condor has some support for data intensive computing, where we need a lot of data for 
biological scientific applications. So, we create a new resource manager called nest that has 
file transfer agents. So, a new file transfer agent was added to Condor called stork that can 
synchronize large file transfers. We can also use a variety of network protocols like http and 


ftp. 


And to smooth out very large data transfers, Condor adds a series of disk base routers that 
optimize the communication with the hard disk. Also, condor has a new module called parrot 
for communicating well, I would not use the word unusual, it should not be unconventional 
storage devices something like flash and so on was unconventional in those days. So, that is 


the why this specific word unusual was used, even though it should be unconventional. 
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Finally, a word about security. So, security is clearly a big issue in a distributed system in a 
condor like system. Because remote job because jobs run on untrusted machines and machines 
run untrusted jobs. So, condor has a secure communication library called Cedar. So, Cedar is a 
wrapper on all secure technologies, like SSL, SASL, Kerberos and other protocols. So, it 


ensures that any network traffic passes in an encrypted fashion. 


Furthermore, for secure execution, the question is that let us say if a remote job is running on 
my machine, what permissions does it have, one of the options is that at resource users are 
given a very restricted login and clearly the chroot the change root feature is not there, which 
means that they are kind of stuck to a part of the file system and there are a very limited number 


of things that this login can do. 


That is one option. But the other is that Unix, all Unix based systems including Linux support 
the nobody account. So, if you look at etc slash password, you will find the nobody account. 
So, nobody account is also one search extremely restrictive account that just allows us to run a 
compute job and access very little of the file system. So, this is actually a default, unless we 
have other mechanisms. Condor uses the nobody account. And which gives it very, very limited 


and restricted permissions on the resource. 


Condor can dynamically assign a user ID to a job and create one that is also possible. Or it is 
possible to create a setter, create a domain of users within a group where they kind of trust each 


other. So, let us say that if my friend wants to run a job on my machine, and I trust my friend, 
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then I can create a login for my friend on my machine or my friend can use my login or we can 


have a standard network login. So, all have those combinations are possible. 


And so, it depends on the level of trust. But as I said, nobody is a default. And creating a single 
domain per user like a distributed group is also possible. And finally, condor has a cleanup 


feature that kills all the processes. And you start from the beginning. 
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And so, the first description of condors, so even the condor has been around for quite some 
time. But the first, nicely written description of Condor the paper that this presentation is based 
on was the Condor experience published in February 2005. So, what I would like to encourage 
the viewers of this video to do, is to maybe download a condor system which is freely available, 


install it on their machines, and larger the cluster the better it is. 


So, they will find that it is very easy to actually submit, monitor and execute a job. So, what 
they can do is that they can create, see if this is the Condor cloud, the machine can submit a 
job and particularly for those who run simulations, and let us say that they need to run 1000 


simulations, these all can be sent to the Condor cloud. Condor will schedule and run the jobs. 


After that, several post processing scripts can be run, which means the outputs of the jobs can 
be taken and graphs can be plotted and then the results can be sent back to the machine. Here 
also we can run a post processing script and it is possible to send an email the Job submitter to 
say that look, all your jobs are done and the results are ready for you to view. So, this is 
something which can increase productivity quite a bit in the sense that we do not have to 


continuously poll the system to find out how far the jobs have gone. 
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So, we can create a large cluster install Condor on it and automatically Condor will ensure that 
all the jobs correctly execute and once it is done, the user will get to know. So, this was kind 
of like pre 2005 technology. So, now, in the next lecture, we will discuss something which is 
far more recent. So, we will discuss the Microsoft Dryad link system, which kind of extends 


this and makes it more sophisticated. 
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In this lecture, we will discuss the DryadLINQ system, which is Microsoft's engine for distributed 
computation. So, prerequisite for this lecture is the lecture on Condor on the Condor distributed 


batch processing system, which is also there in the same lecture. So, it is there in the same playlist. 
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So, we will first discuss the basic idea and the related work. Then we will go through the design 
and the architecture of the DryadLINQ system. And then look at two specific benchmarks Terasort, 
which is pretty much sought a terabyte of data, and the SkyServer benchmark, which is an 


astronomical, a benchmark for processing astronomical data. 


So, the basic idea is that many of the current programming models for large scale distributed 
programming like Map-Reduce, MPI, Microsoft Dryad, are essentially of the same type. So, they 
take they put a lot of load on the user. So, they are kind of based on explicit parallelism. Where 


pretty much a large part of the planning and the scheduling. 
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So, I am not discussing planning and scheduling once again, because this was discussed in the 
Condor lecture. So, where we basically said planning is the mapping of jobs to pools of machines, 


and scheduling is running a job within a pool of machines the time aspect of it. 


So, Map-Reduce, just a quick overview, it essentially takes a large program a large problem, 
divides it into a set of smaller problems, distributes the small problems to different machines, and 
again collects and collates the results. MPI is at an even lower level where machines just send 


messages to each other. 


And based on these messages, the final output of a program is computed. Microsoft Dryad is 
Microsoft distributed execution system where again a program is specified as a directed acyclic 
graph of jobs or tasks. So, where your Task | and Task 2, Task 3 and so on and these are mapped 


to different machines. 


But here again, there is a huge the onus is on the programmer to actually divide a program into 
tasks specify the dependencies between them, when it comes to the dependencies are specified as 
a directed acyclic graph and these run on the system, these are kind of extensions of Condor. So, 


DryadLINQ kind of extends the system. 


So, it takes an existing LINQ expression language integrated query expression. So, these are 
essentially expressions programming language expressions in a .Net framework for each 
expression is a side effect free transformation of the data. The side effect free is where essentially 


we can take a data and we can apply a set of functions to it. 


So, then of course, the data keeps on changing, but it is not accessing any external data. For 
example, it is not changing a file, not sending a message over a network or not accessing any other 
piece of memory that is not contained in here. So, this is a side effect free transformation, which 


is heavily used in functional programming. 


So, LINQ expressions pretty much use that and once a program is written using these LINQ 
expressions, the Dryad system paralyzes portions of the program and runs it on 1000s of machines. 
So, when you combine the existing Microsoft's distributed execution engine, Microsoft Dryad with 


LINQ. 
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Which is a functional programming paradigm for so, when both of them are combined, you get 
DryadLINQ. And so this is kind of a state of the art system. So, using this, it is possible to write 
very simple code to sort a terabyte level data set in just 319 seconds, in just about 5 minutes on a 


240 node system, which is not much so, for at 240 nodes. 


This is actually a very good result. So, a certain prerequisites over here. Before we proceed, user 
should have some idea of functional programming particularly when you know a programming 
that does not use the notion of permanent state. So, it does not use files and networks and so, it 


does not use this, so even arrays. So, it does not have pointers. 


So, very simply have data and you just apply a set of functions to it. So, the notion of functional 
programming should be clear, otherwise, viewers will not be able to appreciate this lecture. So, 
functional programming here we are talking of languages like OCAML, MI, scheme and so on. 


And of course, they should have some idea of a distributed batch processing system. 


Where we have a large number of tasks, the relationships, the dependencies between the tasks are 
specified as directed acyclic graphs, and the lecture on condor which is there in the same playlist 


would kind of set viewers up for what is coming in this lecture. 
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So, LINQ, as I mentioned, is Language Integrated Query, which is a set of kind of imperative and 


declarative constructs in almost all of the .Net languages, which is C sharp, F sharp and VB Visual 
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Basic. So, there are some, so this is not purely functional, because a pure functional language will 
not have many of the imperative directives. So, it is kind of like semi functional and semi 
imperative. So, the imperative directives are, of course, you have variables, loops, and iterators, 


and conditionals. 


For loops, if statements, they are there. And we have standard, so we have a heavy. So, we have a 
very strong notion of types, very strong notion of inferencing types is a very strongly typed 
languages. And we have functors. So, I will not describe the background of functors. But I would 


request the readers to look up the Wikipedia articles on functors. 


And in general, in a treating functions as high level objects. So, during functions as separate objects 
of their pluses, this notion should be clear to viewers before they actually proceed. So, a quick 


scan of these concepts on Wikipedia would set up users for the rest of the 28 slides in this talk. 
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So, a brief review of related work. So, of course, we always have SQL as the gold standard to 
compare with, say in SQL, also, we do not specify how something needs to be done. Rather, we 
specify what needs to be done. So, let us say we want to read a table, and then we want to order 


them we do not actually work at the level of sorting the data. 


So, we do not say use quick sort or merge sort, for example. So, this is a declarative language, 
where you pretty much tell SQL what needs to be done. And of course, SQL has automatic 
parallelization mechanisms where databases automatically parallelized SQL Queries. Or also we 


have variants of SQL known as parallel variants of SQL. 


After SQL, we had a large explosion of these big data programming languages. And the prominent 
among them, the most prominent among them was Map-Reduce. Fair, of course, we take a large 
problem, break it into small chunks. And so this is a map phase, which is that take a large problem, 
break it into small chunks, execute each small chunk, and then again, collate the results (())(09:02) 


reduce. 


So, of course, a lot of operations like sorting or database joins was hard. And so basically, so if 
you would see, two of the problems we have tried to describe over here from the paper are Terasort 
and SkyServer, which essentially performs database joints. So, both of them are hard to do in a 
Map-Reduce kind of setting. Also the support for types and Map-Reduce is kind of weak. The 


reason being that was never the mainstay of the design. 
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So, if there is any problem because the wrong usage of types. For example, something is an integer 
on one machine and it is read as floating point on some other machine. There is actually no way 
to catch such errors. So, then, of course, we have many domain specific implementations of Map- 


Reduce, so we have Sawzall which is very popular, we have Pig. 


So, that also was Yahoo's system. Again, very popular, and of course, Facebook Hive, which is 
Facebook's rendition of Map-Reduce. So, all of these are rather popular, so use. So, they use the 
same simplistic notion of a map and reduce kind of operation. And they are also a combination of 
declarative and iterative constructs and they are inherently extensions of SQL. So, in that sense, 


DryadLINQ makes a marked departure and tries to do what Condor does to pretty much this area. 


So, of course, this is as we had mentioned, at the end of the Condor lecture, that DryadLINQ is 
kind of a more modern avatar of Condor. So, the first is that unlike Map-Reduce, which makes 
rather strong assumptions on the system, in this case, the computation is not dependent on the 


nature of the underlying resources. 


So, all the underlying resources are reasonably independent of the computation. The second is that 
we make a virtual execution plan. So, here also the planning phase and the scheduling phase are 
separate. So, we make a virtual execution plan, which says that given a set of jobs we will execute, 


and then of course, the plan is scheduled on a cluster of machines. 


And in the cluster machines a lot of things can happen, lot of dynamic changes can happen. We 
can faults, we can outages, a machine can use swap jobs out, all of those things can happen. So, 
that would be the scheduling aspect. But here also planning and scheduling are separate. So, what 


the scheduler gets is a virtual execution plan. 


Structure of a Dryad job well, similar to Condor, it is a directed acyclic graph of jobs, T1, T2, T3 
and then of course, we can have a job T4, that is dependent on the completion of both T2 and T3. 
In this case, each of the vertices here is a program. And each edge is a data channel that transmits 
a finite sequence of records at runtime. So, T4 would get some data from T2 and some data from 


T3, once it gets the data of both, T2 and T3, then it starts executing. 
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So, now we will discuss the System Architecture, architecture of the system. So, the Dryad system 
is architected like this, it contains a centralized job manager. Similar to the manager worker 


paradigm in Condor, it has a centralized job manager. So, this instantiates a jobs data flow graph. 


It also schedules all the processes that are similar to the tracking manager and steering manager in 
Condor, schedules the processes that are supposed to run, ensure fault tolerance, if there is a crash 
it essentially restarts a job, monitors the job and transforms the job graph at runtime according to 


the user's instructions, so this part is new. 
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So, the job graph can dynamically transform we will see how according to certain directives that 
the user gives, so this part is new and rather interesting. So, how is the what is the execution plan 
or the execution overview? So, the overview is like this, the user runs a .Net application and a .Net 


application creates a DryadLINQ expression object. 


So, in functional programming, what we can do is that we can kind of create an expression object. 
So, it is also possible to partially evaluate it in the sense that the function g will be evaluated on 
one machine and then what will be sent to the other machine is f with the output of whatever g got 


and whatever g produced and this can again be evaluated at some other machine. 


So, partial evaluation is also tantamount to deferred evaluation, where for a sequence of actions 
we (())(14:41) each action is being represented by a function call. Some of it is being evaluated 
immediately, and some of it is being evaluated later. So, let us say in this case, we evaluate g 


immediately and evaluate f later. This is called deferred evaluation. 


So, of course, people who have studied functional programming would find this very easy to relate 
to. But for people who have not studied functional programming, I would definitely ask them to 
do a tutorial over here on Wikipedia or in anywhere else that talks about partial evaluation. So, we 


evaluate something partially now and then defer the rest for a later point in time. 


So, the application calls the method To Dryad Table. So, ToDryadTable, what it does? Is that this 
method hands over the expression object to DryadLINQ. So, in this case, so, this expression is 
essentially an expression of a data and what needs to be done with the data. So, this entire thing is 


the expression object. 


So, the DryadLINQ system is given the expression object, DryadLINQ compiles the expression 
and makes an execution plan. So, the execution plan what it does is first it decomposes it into sub 
expressions, smaller expressions, it generates the code and data for the different Dryad nodes. So 
of course, since we are talking of parallelization over here, the same work will be kind of divided 


and distributed across the nodes. 


So, then, for each of them, it will have it may have separate code and data, for separate code and 


data, whatever it needs to be done over here that is generated. And then similar to Condor, when 
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data is being sent from a manager to a worker using the same terminology, it is very, very well 


possible that this uses a big indian representation. And this uses a little indian representation. 


So, to ensure that data is transferred correctly, we actually need to make the transfer kind of 
independent of the exact notation. So, this process is known also a serialization and deserialization. 
Or it is also known as marshalling and unmarshalling. And once the data reaches there, all the 


workers need to work. 


And maybe let us say you know, after that, they send the data to one more worker. So, he this is 
like a synchronization point, because it waits for all the other intermediate nodes to finish their 
work. So, all of this code is automatically generated un belongs to the user. So, all that the user 


actually gives DryadLINQ is an expression. 


And everything else happens automatically, which is what which is decomposition into smaller 
sub expressions, generation of code and data for the individual Dryad nodes, generation of 
serialization, deserialization, also known as marshalling and unmarshalling code, and the code to 


synchronize the data accesses data and code accesses across the nodes. 


So, Dryad invokes a custom job manager to do all of this. So, the job manager here creates a job 
graph. So, the role of the job graph is to schedules and spawn the jobs. Each node in this graph 
executes the program that is assigned to it. And of course, when the program is done, it writes the 


data to an output table. 


So, almost all the data in Dryad is stored in a tabular format recall percolator and Google big table. 
So, the inputs also go via a table and outputs also come back via a table. So, after the job manager 
terminates, which means all the individual tasks terminate, DryadLINQ collates all the outputs and 
creates the final DryadTable object, which essentially contains all the results. So, control then 
returns to the user application and what so this can be a very large table. So, this may not be stored 
in one location. It is actually a distributed table, which is stored across many locations stored across 


multiple locations, not across one. 


So, what Dryad does is that it passes an iterator object. So, the iterator object is essentially a handle 


to an object to traverse this table. And the iterator object pretty much traverses this table. And 
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wherever the data is there in the network, it is fetched. And the iterator object is something also 


that can be passed to subsequent statements. 


So, it is like a handle to the result. But the results themselves can be quite large. And they can be 
distributed. But that does not matter. As far as the invoking application is concerned. To get any 
row of the table, it just has to use the iterator object and Dryad will automatically fetch it from 


wherever it is stored. 
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Now a word or to about LINQ. So, LINQ has a very deep object oriented hierarchy. The base type 
in LINQ is an interface, recall that an interface primarily defines functions of the type 
TEnumerable. So, IEnumerable basically is an iterated where we can just go through every entry 


of the table, that is the base class. 


So, it is an iterator for a set of objects with type T. So, the programmer is not aware of the data 
type that is associated. That is with an instance of IEnumerable, it does not have to know. So, 
IQueryable is a subtype of [IEnumerable. So, this is an unevaluated expression. So, it is kind of 
those functional expressions that we have spoken to in the past. So, basically, let us say that we 


want to add 2 + 3. 
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So, will, say maybe 2 + 3 + 5. So, this can be represented as f (f (2, 3),5) where f is the add function, 
and this and 5. So, it is possible that we can evaluate this, the first f in one machine and then 


transfer to another machine for deferred evaluation. This is of course, a very simple example. 


But in a real life setting, such functional programs do undergo deferred evaluation, because many 
times the inputs that are required like this 5 over here, this input might not be there, or the resources 
required to compute this f might not be there, we might need a specialized resource to compute 


this. 


So, that is why the entire function is treated as an expression, which undergoes different phases of 
evaluation at different points in time. So, to implement the IQueryable expression at runtime, it of 
course, instantiates this and creates a concrete class. So, what are the pre-reqs to understand what 
I just said? Well, the ideas of object oriented programming have to be very, very clear, particularly 


in the context of .Net. 


So, it should not be clear, what is a class, what is an interface and how do you instantiate a class. 
So, these are some of the basic fundamentals that I am assuming before I go forward. So, let me 
give an example of the LINQ SQL syntax. So, let us say I were to join two tables scoreTriples in 


staticRank. 


So, I were to say that the adjusted score triples is that I take each entry d in scoreTriples, and I join 
r in staticRank, as long as the document ID of scoreTriples equals the key of the entry r in 
staticRank. And then I select anew QueryScoreDocIDTriple, which contains d and r. Furthermore, 
what I do is, I take the adjusted score triples, which are essentially d r, which are essentially triples 


of d and r, as long as d. docID = r. key. 


So, if this is a regular database join, there is nothing great with this, but this is exactly the way that 
we will specify it in LINQ. And so then, we can create a rank queries which means I take each s 
in adjusted scoreTriples, then I group the s s by you know the s entries by the value of the query s 
dot query into g. And then for each such group, I take the TopQueryResults. So, as you can see, 
this is not very different from a regular SQL syntax. So, this is a LINQ SQL syntax example, which 
is an extension of LINQ to model SQL. 
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So, the moment that such a query given to the Dryad system, it internally breaks it into expressions 
and sub expressions. And as we just discussed, there is a complicated compiler chain that 
instantiates code and data, creates code data Triples that are sent to the individual machines that 


run parts of this query. 


So, again, the same thing can be done with object oriented. So, basically this should be LINQ. And 
so here, the idea is that instead of using an SQL like syntax, we could use an object oriented 
programming like syntax. So, in this case, this uses the same thing, where we just call the join 
function, on the staticRank, and where the doc ID and r, of course, the join is on d and r where d 


is the doc ID, and r is the key. 


And both of these need to be equal. And we then create a QueryScoreDocIDTriple. And we do the 
same thing we group the queries based on the query field. And finally, for each group query, we 
select the top Query results. So, this query is exactly the same as the previous one. It is just that 
LINQ provides different ways of expressing the same thing, just to ensure that it is compatible 


with all the existing query languages. 


So, any DryadLINQ collection [Enumerable is essentially a distributed data set. So, it is so the 
input is always a table similar to SQL. It is a large table, similar to Google big table as well. So, 
you can partition that in several ways. So, after the large data set, needs to be partitioned among 


the nodes such that each node can work on a part of it. 


So, either what we can do is we can hash these elements, each of these elements, so let us assume 
that each of these elements, we can hash them, and we can store them in the sorted order or their 
hashes and partition them on the basis of the hashes. This is hash partitioning. Or we can take any 
single column of the row and based on the value of the column, we can create ranges and partition 


or we can simply distribute the rows and the Round-robin Partitioning. 


The results of any single DryadLINQ computation are represented by the object DryadTable T, 
where T is the type of the data. And the subtypes will determine the actual kind of storage whether 
if there is an integer, we store it in a certain way. If it is a floating point, well of course, we then 
store it in another way. And we can also include additional details to specify the schema, the exact 


storage structure and the metadata of these DryadTables. So, what is the broad idea now? 
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Well, the broad idea is that if this was our DryadLINQ system, what essentially we get in is we 
get LINQ expressions and we get in data. The data is stored as tables. Instantiations of the 
TEnumerable interface. Dryad internally partitions the tables based on any one of these algorithms 
hash, range or Round-robin partitioning. Then the individual partition tables are sent to different 


machines where LINQ expressions or sub expressions run on them. 


And finally, we get the result which is a table of course, a table is very large. So, it is a distributed 
table where parts of it are stored on different machines. But the client that actually issued the 
request does not care because it gets an iterated in return. And the moment it accesses something 


Dryad knows where to locate the data. 


So, all of the methods as we have discussed in the past even with Condor have to be side effect 
free. So, being side effect free means that only the part of the data that we are changing actually 
that changes. So, there are global variables another way of side effect free means that there are no 


global variables no file is nothing outside the scope. 


So, shared objects can be distributed in any way. So, the functions to access a DryadTable are 
serializable. So, serializable in this case means that the serializable semantics are followed. So, 
when I say get table, this means that I first you know I get the entire table. So, this is like a 


transaction. And ToDryadTable as we saw is a gateway function for starting in a Dryad job. 


And we can have, we can also define our partitioning functions, such that the tables can be 
partitioned. So, we can define our custom hash partition function, or a custom range partition 
function. So, this in a certain sense is conceptually like Map-Reduce. For Map-Reduce partitions 


the computation and Dryad partitions the data. 


So, in this case, if you look at it, a large table is being partitioned into smaller tables, regardless of 
how it is being done, each of these smaller tables are dispatched to different nodes. And then on 
the different nodes, LINQ expressions actually work on the table to produce an output they can be 


the same LINQ expression or different LINQ expressions. 


In addition, we have two specialized functions that actually run on them. One is the apply function, 
which applies a given function f to all the elements in the table, so we take each row of the table, 


or we take each cell of the table right within each column of each row. 
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And we apply the apply function to all of them. For example, the apply function can be 2 multiply 
by 2. So, we can take the table and multiply every single entry by 2 that would do it or we can 
have a fork function. So, it is a similar to apply but output can be not just one data set, but multiple 


data sets. 


The fork function can take this and output can be like two tables, that is a fork. And Dryad is these 
may be expressive. So, it supports different parallelization and storage policies and they can be 


provided as directives. 
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Evaluation 


| Results: Terasort 


@ The number of nodes was varied from 1 to 240 
@ Each node stored 3.87 GB of data. 


@ The execution time was 120s for 1 machine, and quickly 
jumped to 250 s. 


@ Then it grew very slowly (sub-linearly) to 320s. 
Source [1] 
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So, let us now discuss the execution plan graph or the EPG. So, EPG is something like this. So the 
DryadLINQ system converts this also should be LINQ expressions to the nodes of an EPG. So, 
the EPG is DAG. So, we essentially create tasks and then sub tasks and then there are relationships 


between the tasks in the same way a DAG functions. 


Furthermore, it is possible that a part of the EPG can be generated at runtime based on the values 
of iterated or conditional expressions. So, this was mind you not possible in Condor. So, here, let 
us say we can have a task, if statement, we can add an extra task. Or if statements, so the statement 


is successful, we add an extra task, otherwise, we do not. 


So, it is possible and dynamically based on the data, the EPG can change. So, this is a rather 
powerful feature of DryadLINQ. Furthermore, similar to the class add system, in Condor, we 
specify the requirements of a job in its metadata. So, this specifies what are the requirements of 


the physical nodes of the machines that are actually running the job? 


And what are the parallelization directives? What are the directives that we use to actually paralyze 
the tasks while generating the EPG. Second, this is new in a functional paradigm, we need to 
support the deferred evaluation of functions. This is something that we have to do to support the 


LINQ semantics. 


So, there are several kinds of static optimizations that we can do. So, these are not dynamic 


optimizations, these are essentially compile time optimizations that we can do. So, the first is 
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Pipelining. So, which means that a single process can execute multiple operations, so let us say it 


can take one query. 


So, it can just kind of pipeline them in a pipeline fashion, which essentially means that let us say I 
can take some data, process it. So, I can you know, within a node, I can have multiple threads, 
associated with a process where one thread produces some output with a second thread consumes 


that produces and output the third thread consumes so on and so forth. 


So, it is possible to do pipelining within a node, we can remove redundancy, which means remove 
dead code, remove LINQ expressions that will not be evaluated. We can remove unnecessary 


partitioning. 


Let us say that prior to compiling; we realize that the user has specified we need to create too many 
partitions. But in practice, maybe the maximum number of partitions you can tolerate is 100. And 
the user is saying look, you create 1000 partitions. So, we can remove this, so the user can only 


give hints but not orders in the system. 


So, we can remove this and we can only create 100 partitions based on the system you are looking 
at. So, what are the two optimizations we have seen till now pipelining, which is again, intra node 
and redundancy removal? The third is Eager Aggregation. So, this intelligently reduces data 


movement by optimizing the aggregation and re partitioning steps. 


So, which essentially means that whenever we take any data, essentially we break it down, we 
partition it into several pieces of smaller data, they are mapped, and again, they are aggregate. But 


let us say there are too many partitions, then the process of aggregation will take a lot of time. 


And also these data items can be mapped to different machines, and there will be a lot of 
communication as well. So, we can instead do smart aggregation. So, maybe a single machine can 
process these two partitions. And a single machine can process these two partitions. And maybe 


one of the machines can do the aggregation as well. 


So, in this case, what will happen is that we are reducing the data movement and this will further 
cut down on the latency. Finally, we can reduce the amount of IO. So, to reduce the amount of IO 
instead of creating TCP connections for every single transfer, you can have a persistent TCP 


connection. 
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And if there are different processes on the same physical node, we need not write to file switches 
the default actually across nodes. So, we need not use files, we can use in memory shared memory 
channels to transfer data between processes. So, the main aim of the fourth point is that wherever 
possible, we try to make use of whatever the underlying system is providing in terms of either TCP 


pipes like persistent TCP connections, or in memory channels. 


To increase performance as much as possible, and let us say for example files are slow and shared 
memory is fast. So, if shared memory is available will always opt for shared memory and not go 


for files. 


A few dynamic optimizations does a dynamic optimizations happen when you actually see the data 
and the job runs. So, consider OrderBy. So, OrderBy is one of the SQL queries, this is one of the 
SQL directives, which pretty much talks about sorting the rows based on a key value of a certain 


column. So, what we do is we deterministically sample the values. 


So, we just randomly sample the values first, we create a histogram and then we do a range 
partitioning. So, let us say the sample the values only find that look the way that the value is 
actually vary a histogram of this will be something like this, then maybe we create one range 


partition to be like this, we try to keep the area under the curve the same. 


So, we can create you know some sort of intelligent range partitioning based on our sample. So, 
this will ensure that each physical node is given the same number of entries to sort. Once we have 
partitioned the ranges, then we can kind of divide the input based on the ranges among the nodes, 


each node fetches the inputs assigned to it, and sorts it. 


So, clearly, we can also pipeline these two actions of fetching the inputs and sorting them. So, this 
can be done. And so, this is one example where we need to see the values at runtime and then first, 
so, this is exactly how Terasort was implemented first to sampling to give an idea of what are the 
range of values from the histogram, we figured out a range partition, you perform a range partition 


on the table of data, we assign a range to each node. 


So, the node basically fetches data and simultaneously sorts whatever it has fetched this can be 
pipeline. So, this is an example of a dynamic optimization that is done performed in the 


DryadLINQ system. 
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So, this is kind of a pictorial representation of what was just described, where we have 
deterministic sampling first. And deterministic sampling provides us the histogram and the ranges 


based on that we allocate a range to each node and it pipelines the fetch and sort. 


So, the EPG over here is a virtual execution plan and DryadLINQ has to dynamically generate 
code for each EPG node. So, this we have discussed in the past as well. So, what Dryad LINQ 


does is it generates a .Net assembly snippet. 
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So, .Net has its own assembly language syntax that corresponds to each LINQ sub expression, it 
contains all the code for serializing data and IO and for ferrying data. So, all that is built in the user 
need not bothered about any of these steps. The code of the EPG node is generated at the computer 


or the client which is the job submitted. 


So, the nice thing about .Net is that it is actually a virtual instruction set similar to Java byte code. 
So, we can always generate the .Net code on any machine and when it physically runs on the actual 
machine. There is a translation between the virtual code that .Net has generated and the real 


machine instructions. 


So, this is very similar to the way that Java and Java byte code work. So, the EPG code node code 
is generated on the computer or the client the job submitter, because it may depend on certain 
things of the local context. So, similar to a condor shadow, there might be local files that are in the 


local file system, there might be environment variables. 


So, all the values in a local context get embedded in the function expression. So, which means that 
if let us say this is the entire LINQ expression, whatever can be brought from a local context, all 
of that is kind of embedded in the expression. So, then we have a partially evaluated, so let us call 


it PE partially evaluated expression. 


This partially evaluated expression is now dispatched to each of the individual nodes for further 
evaluation for deferred evaluation. So, we use .Net reflection. So, .Net reflection is basically so 
reflection is a general concept. And it had come in Java a long time back. So, in this context, 
actually, the .Net virtual machine takes a look at each class or takes a look at all the libraries, and 


finds all the libraries that it is dependent upon. 


So, those dependencies will have dependencies. So, the transitive closure is pretty much the set of 
all the libraries that you know, let us say there is library A. All of the libraries that it is dependent 
upon all of the libraries that they are dependent upon, so on and so forth, till we have captured the 


entire set. 


So, that the EPG code and all the associated libraries are shipped to the cluster computer, or remote 


execution. So, this is a marked departure from what Condor does in this situation. So, Condor 
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actually sets up a channel between the sandbox and the shadow such that a sandbox need something 


which the remote machine does not have. 


So, then the sandbox in Condor would actually send a message to the shadow, saying that look, I 
need this, the primary reason being that these are actually separate setups. And the notion are 


deferred evaluation was not there. So, and also the network speed those days was slow. 


And also, the other thing is that maybe they use different versions of an OS, any code over here 
would actually not run over there and vice versa. But because of the fact that .Net, like Java is a 
virtual ISA. So, it is very portable across all kinds of machines, the network is fast. Hence, in this 
case, we do not do what condor does. But instead, we ship the EPG code and all the associated 


libraries to the remote machine; everything is sent. 
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A few miscellaneous items. So, there are extensions of the LINQ system. So, we have PLINQ 
where it runs a sub expression in a cluster node in parallel using multi core processors. So, that 
also, that extension is there, where the users can supply annotations. And these are similar 
annotations similar to parallel programming variants like OpenMP, where it specifies how these 


iterators are supposed to run. 


And also LINQ interacts with SQL we have already seen an example how, where it can directly 
access SQL databases, same internal data sets in SQL tables, and also translate LINQ sub 


expressions to run directly as SQL commands. So, certain interoperability with SQL was added. 


So, if you would actually see the design, which is based on tables, at least at this level, it is similar 
to SQL. Even differences arise later, in terms of deferred evaluation and usage of the basic .Net 
system. But at least there is a certain level of high level compliance, which allows a degree of 


interoperability even with traditional database systems. 


How do we debug a massively parallel application? Well, how did we deal with this problem in 
Condor? In Condor, what we did is we had the notion of a rescue DAG. So, here, what happens is 
that if the post processing script of a Job said that a given job could not execute correctly, then we 
take the entire DAG, we cut out the part that executed correctly, and then again, cut out the part 


which did not execute correctly. This can then be analyzed by the user. 
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And the user can figure out, out of this, which job did not execute correctly, that part again can be 
like Repatched. And then the entire Dryad can run months ago. So, .Net, of course, had the 
advantage of the Visual Studio interface to debug many of these things. And the great and the 
really praiseworthy thing that DryadLINQ added is a deterministic replay model. 


Which means that given the fact that these are actually kind of functional programs, which are side 
effect free and which do not rely on anything which is outside the environment. So, essentially, if 


you look at any LINQ sub expression, it is like a program where all of the inputs are specified. 


So, every time that you run this program, regardless of wherever you run it, because .Net like Java, 
is provides a degree of platform independence if input remains the same output has to remain the 
same. This means that we can, let us say that a given job a given task failed on a remote node, we 
can then bring it back to the client machine, we can provide it the same inputs, which it got on the 


remote node. 


And we can essentially debug it line by line and see exactly what happened. So, we can pretty 
much replay the entire execution event by event and view the outputs on the local machine figure 
out what went wrong, fix it and send it back. So, this is the advantage that there is no non 


determinism in the execution, which is actually there in other frameworks like MPI and so on. 


And non determinism would only crop up if this would depend on something which is not really 
an input, but produced by some other task at an intermediate halfway state. And then depending 
on the latency of the network, the execution can be non deterministic, it can either get a certain 


event or not, get a certain event, the time of getting it also matters. 


So, all of this has been avoided. So, this is simply a LINQ sub expression. It is given a set of inputs 
fully or partially evaluated. And then for the same set of inputs, the output will be the same, the 
behavior will be the same on all machines that support .Net. And so this has taken care of the 
performance sorry of the debugging aspect very well. So, along with collectors debugging, we can 


also have performance debugging, where we collect detailed execution profile information. 


For example, it is possible that a given job runs very slowly. So, to actually fix that, we can collect 
detailed runtime statistics to figure out why a job runs slowly. And then from the information 


maybe we can find out why. So, the execution setup it was a 240 computer cluster. Each node 
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contains two AMD Opteron nodes, which are server processors 16 gigabytes of main memory, 


which was, which was considered good 10 years ago. 


The first was Terasort where we sort at a terabyte sized dataset. And so essentially, the idea here 
was that so of course, it is a terabyte when we consider 240 nodes. Otherwise, if we consider n 
nodes it is 3.87 gigabytes per node. So, the way that they did the experiment is that they consider 
a terabyte of data for 214 nodes, but they also considered you know 100 nodes 50 nodes and so 


on. See the number of nodes is n the data was scaled accordingly. So, it was 3.87 n. 
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So, the first as we discussed Terasort. So, Terasort to the execution time. So, again, the data was 
kept constant per node. So, when you had a single machine, the execution time was 120 seconds. 
And it quickly jumped to 250 seconds and kind of the execution time was like an asymptotic 


behavior with it kind of saturating at 320 seconds. 


And of course, at one point with 240 nodes, the data was | terabyte. So, the reason for this, you 
know asymptotic behavior. So, of course the starting point is not 0, it is 120 is that as we add more 
nodes, there are more overheads in terms of EPG compilation, dispatching the inputs and the code 


to different nodes, network delays, switch delays, and so on. 


But still, it is not that bad given that from 1 to 240 machines, the execution time, just increased by 
around 2.6, 2.7 times in triple. And sorting a terabyte size data within 5 minutes with rather simple 


code with the partitioning mechanisms that we have discussed, is genuinely praiseworthy. 
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The other was a skyserver benchmark, which contains a lot of astronomical data. There were so it 
was a 3 way join for 2 tables. One was 11.8 gigabytes others size was 41.8 gigabytes. And the 
number of machines was varied from 1 to 40. So, the speedup increased. So, they compare this 
with the baseline Dryad system that does not use the LINQ expression does not use this 


mechanism, but just uses the basic distributed framework. 


So, of course, the speedup was much higher with Dryad increase from 1 to 24 sub linearly of 
course, and this went down from 24 to 19 in the case of DryadLINQ, but well if you think about 
it, it is not that bad given all the automatic things that DryadLINQ actually does, and how much it 


makes programming easy for us, easier for us. 


So, the paper that describes the DryadLINQ was published in OSDI 2008, around 12 years ago. 
So, for those days, these were fantastic results. So, now of course, because of an improvement in 
hardware, the results have become better, no doubt. But the important point is that for leveraging 
a cluster of heterogeneous machines, we need systems like Condor of course, which was like, 


which saw its 3 days from 1990 to let us say, 2005. 


And then, of course, the DryadLINQ kind of system. And now, of course, we have many more 
systems. So, Condor did small a large number of other projects like LSF and load leveler and so 
on, which are still heavily used in companies. And DryadLINQ has also spawn a large number of 
similar projects, where we are talking of really expressive mechanisms, of describing a parallel 


execution. 
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And the underlying framework actually does a lot for us and that too automatically. So, this is a 
continuing area of research, because along with regular processes, we have GPUs and accelerators 
and so on. And so then, for GPUs, of course, we are OpenCL, which is not exactly distributed, but 


this is kind of in that direction. 


So, we have many such frameworks and will continue to have, so this area is clearly not going to 
die over here. As the diversity of computing devices increases, will have more and more of such 


frameworks, which actually do more of the planning and scheduling automatically. 
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Welcome to the lecture on the CAP theorem. So, the CAP theorem is a very important theorem in 


the design of distributed systems and distributed databases. So, in this lecture, we will discuss 


many other aspects related to the C of the CAP theorem, which is consistency. So, think of this 


lecture as a lecture where we discuss issues related with consistency and issues related with the 


CAP theorem. 
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So, the basic idea of the CAP theorem is like this. So, this theorem and an extension were initially 
kind of, were put forth to the community, basically, as a conjecture as a part of a talk it was put 
forth to the conjecture, and it is proved much later in some cases by the same person in some cases 
by others. So, the PODC conference, which is one of the top conferences in distributed systems, 
Eric Brewer made the following conjecture; that you cannot design a protocol or a web service 
that simultaneously at the same time guarantees the three properties which are as follows; 


consistency, availability, and partition tolerance. 


So, we will discuss what these are? But essentially, these are desirable properties of any distributed 
system or any database where you want the data to be consistent, consistent with what? Consistent 
with specifications. What kind of specifications? Well, we will see. Availability means it should 
be available, in a sense every request gets a response. And partition tolerance means even if, the 
network kind of becomes two separate partitions, the system is still operational. So, he said you 
cannot guarantee all three together. So, this may sound obvious, but it was not or it still is not, if 


you actually rigorously apply it. 
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* Isolated > Uncommitted transactions are isolated from each other 
(cannot see each others updates) 


en > Once committed, a transaction’s changes are permanent 
So, how do we understand them? So, we need to understand that all of these theorems basically 
came out of the database literature. And in the database literature, we have the classic ACID 
semantics, which is something that needs to be discussed first, before we get into the CAP theorem. 
The operations are atomicity. So, the idea is that either a single operation or a bunch of operations 
known as a transaction, either all of it successfully completes, so, this is called a commit event or 
it fails in entirety, so, it appears as if the entire sequence of operations had never started in the first 


place. 


The next is consistency is as I have discussed, consistency means the result is consistent with 
execution. The result of an execution is consistent with specifications. Of course, the specifications 
may themselves vary, but still there are some commonalities and different specifications which we 
will discuss. Isolated means it appears as if the transactions which are not committed isolated from 
each other. In the sense it appears that if let us say two transactions like two bunch of database 
transactions are happening at the same time, then they are happening in isolation, which means 


one cannot see the updates made by the other. 


Final is durable, durable means once the transaction is committed. So, once we are done with an 
operation, the changes are permanent. So, they feel they are returned to permanent storage. So, 
distributed system services clearly need a different model the ACID model will not work for them. 


So, in this case, we adopt a different model which I will describe next. 
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So, let us discuss consistency, so, the dictionary meaning as I said is that the outcomes of an 
execution should satisfy some written specifications or some intuitive notion of correctness as the 
dictionary meaning. So, if you break it down, so, we can break it into several components. The 
first component would be atomicity, which means that all operations appear to take place 
instantaneously. So, it is as if, given an operation or a bunch of operations, it appears to take place 


instantaneously in the sense you cannot see any intermediate state. 


So, let us first define some terminology and then I will explain, select Wx1 mean, write 1 to the 
global variable x. Let Rx1 mean that I read the value of global variable x to be 1, let P1 and P2 be 
processes P1 and P2. And let x and y be global variables initialized to 0. So, what we have over 
here is Pl and P2 two processes, processing in a distributed system. So of course, whether you 


have shared memory or not, it does not matter. 


But in this case, let us assume we have shared memory. In the sense they are operating on the same 
object. And objects are read-write objects, and they are of type x and y, both initialized to 0. So, 
let us look at this execution you write 1 = x, write | = y, read 1, read y = | and then you read x = 


1, so, this is the execution. 
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So, what we want to do now is we want to see why does this example have atomic writes? That is 
because we can lay the operations out in a sequence where we will first have this Wx1, then we 
will have Wy1 which goes over here, we will read y to be 1, which again goes here and we will 
read x to the | which again goes here, so, Wx1, Wyl, Ryl and Rx1. So, as you can see, we can 


lay them out. 


And it appears that the write happened at | instant, then the next write happened at | instant, then 
the read again happened at | instance reads in general R atomic in all kinds of systems writes may 
not be, so, it read why as you wrote y | to y, so, you are reading 1, you are reading y = | and you 
wrote | to x and you are reading x = 1. So, you can lay them out, furthermore, the sequence is 


legal, which means if you look at every read, it fetches the value of the latest write. 


Again, to the same address, it is fetching the value of the latest writes, R/1 is providing you Wyl 
and Rx1 is providing with a value written by Wx1. So, this is clearly a sequence with atomic 


events, and this sequence is also legal. What else? What else is special about this sequence? 


(Refer Slide Time: 07:21) 


Sequential Consistency (5c) 


P1 p2 Equivalent global order 


Rx1 rR § Ps ff 


* Note the order of operations within each thread 


CWx1 
hyogiom 
progton ) 
re 


* They have the same relative order in the equivalent global order 


Sequential consistenc 


& We can reduce a paralle! multi-process execution to a sequential execution, 
where each operation is atomic, the sequence is legal, and the intra-thread 
(G (program) order is respected. The operations are basically interleaved to 
get the equivalent global order. 


NPTEL 


So, let us look at this sequence once again, let us not look at the title of the slide yet. So, we can 
lay them out in an equivalent order. So, of course, these are two separate processes, but we can lay 
them out in a sequential order where it will appear that some master execution engine is choosing 
one operation from here then the next then the next and so on. So, you have Wx1, Wy1, Ry! and 


Rx1. 


You note the operations, order of operations within each thread. So, what we see here is that the 
within the threads, this is the first instruction, this is the second, so, clearly the intra trade order is 
Wx! first and Ry! after that, and we get to see that. Similarly, we see Wy] first and Rx1 after that, 
so, we also get to see that. So, what we see is that if we create an equivalent global order number 


one it is atomic, because it appears even at this entire operation executed at one instant of time. 


And this is true for the rest it is legal, as well as the order of operations within each thread is the 
same as the order in the equivalent global order. So, this is also known as a program order, so, the 
order within a thread is also known as the program order. So, this particular execution or its 
outcomes rather are such that so let us read this, so, we can reduce a parallel multi-process 


execution. 


So, in this case, we have a parallel multi-process execution because we have two processes P1 and 
P2, we are kind of reducing it to a sequential execution, where three things hold each operation is 


atomic, the sequences legal and the intra-thread order or the program order is respected, so, all 


three things are happening at the same time. So, it appears as if you take the operations of the 
different processes. And you can interleave them the same way you take your fingers and you can 


interleave them, you can interleave the operations, you will have different global orders. 


But the idea is that the execution is consistent with at least one global order. Fair three things are 
there atomicity, legality and the program order is respected. See, any search execution is known 
as a sequentially consistent execution you can clearly see where the name comes from; the name 
comes from the fact that a parallel execution is being equated with a sequential execution. So, you 


say that the execution is sequentially consistent. 


So, SC is by far the most popular consistency algorithm in I would say all of distributed systems 
even though many a time enforcing SC itself can be quite impractical. But still the idea is that SC 
is quite popular it was published or proposed by Leslie Lamport way back in the seventies. But 
henceforth, it is kind of regarded as a gold standard of consistency in distributed systems of all 


kinds. 
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So, can we have a non-atomic execution? Well, we can. So, let us assume we have three processes, 
and you write | to x. So, process two reads x and it sees that the value is 1, so, this is fine, you can 
see that there has been a data transfer like that, and then assume that you respect program order in 


the sense you finish this operation and then you do this. 
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So, after reading x = 1 you write y and you write the value of | to y, then again you read y = 1 and 
finally, you read x, what do you expect? You expect that since there is a causality, since there is a 
chain of dependence over here, if you have written 1 to x you will also read x to be 1. So, you 
would expect to see Rx1 over here, but let us say you do not see it, if you do not see it, what does 
this mean? This means that this write has gone to P2 first and this has gone to P3 much-much later 
that is what you have seen that the write has gone to P3 much-much later and it has at this particular 


point it has not reached P3 but it has reached P2. 


So, this means that if the program order is respected, we expect P3 to read x equal to 1, but in this 
case it has not, it has read x equal to 0 which essentially means that the writes are not atomic in 
the sense of the write Wx1 over here has not happened or has not appeared to happen at a single 
instant of time. Instead, it is appearing that it is like happening at different instants of time to 
different processes. So, in the first case it is appearing to have occurred before Rx1, in the second 


case it is appearing to happen after Rx0. 


So, in this case, the writes are not atomic and in many distributed systems writes are not atomic 
mainly because to make a write atomic what you have to do? You have to broadcast the write to 
everybody and ensure that a process can only read a write when the update has been propagated to 
the rest of the processes. So, this is an atomic write problem and of course, there are algorithms to 
ensure atomic writes, but a naive implementation would lead to a non-atomic execution. So, in 
other words, in this case, you can see an intermediate state where the write has gone to one process, 


but not gone to another process. 
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So, what is the problem with sequential consistency? Well, if we come back to the same example 
and the equivalent global order, the idea is you only talk about relative orders in the sense you say 
that look first, this operation happened then this, then this and then this. Fair enough, so, this you 
are guessing from the outcome, but the point is you are totally ignoring the real time the absolute 
time at which these operations happen. So, maybe it is possible that in terms of absolute time 
operation one happened first, operation two happened after that and then operation three happened 


after that. 


So, in this case what would happen? In this case operation two will not read y to be 1, but it will 
read y to be 0 then operation three happened and then four, so the rest are fine. So, the issue is that 
if this was the real order of operations happening and let us say with each operation you can have 


a start time and an end time. 


So, it is very well possible that after operation one ended operation two began, after operation two 
ended operation three began so on and so forth. So, in this case, what you will see is that even 
though this execution is sequentially consistent, it is not really consistent with the wall clock order, 
because, here we are only talking of relative orders first, this operation then this it has nothing to 


do with real time. 


But the moment you bring in real time then and if let us say this was indeed the real time order fair 


operation one stopped operation two began, operation two stopped, operation three began so on 
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and so forth, you will see that the outcome will not be this instead the outcome will change, and 
the outcome will become Ry0. So, if the outcome becomes Ry0 then, clearly in this case sequential 
consistency would give you a different answer and adhering to the real time order you will get a 


different answer. 
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So, whenever we are fine with just a relative order we will accept any outcome that sequential 
consistency gives. But the issue is that given the fact that SC is more of a theoretical model, 
because it preserves relative orders, and it preserves read-write relationships and atomicity. We 
call it a theoretical model, primarily because it ignores real-time values and constraints, so that is 
the reason we ignore it. So, if we put in additional constraints to SC, we come up with another 


consistency criterion. This consistency criterion is called linearizability. 


So, linearizability is something which is more powerful than SC. And the reason that it has been 
added is basically because there were issues with SC, and the issues were pointed out in the 
previous slide. So, in this case, the two additional caveats that we add (are like that) are like this, 
that every operation of course, has a start and end time. So, atomicity basically says that an 
operation needs to appear as if it is executing at a single instant of time. But this single instant of 


time could be any time in the sense it could be right over here after the operation has ended. 


So, in this case, the results of the execution will not be consistent with the real-time order of 


operations, so, linearizability does not allow this it says that the time at which an operation appears 


to complete instantaneously, it still has to complete instantaneously has to be between start and 
end, that is the first point. So, then another thing it says is that assume operation B starts after 
operation A ends. So, this is operation A and this is operation B, then in the equivalent global 
sequential order that we create, so, we created that in the case of SC as well. SC is sequential 


consistency, B needs to appear after A, so, that is clear. 


So, why is that the case because clearly these two are not concurrent operations? That is the reason 
A appears first and B needs to appear after A. Of course, if there are concurrent executions, like if 
this is the case, then linearizability does not say who needs to appear after whom because there is 


an overlap. 


But again, since they appear to execute instantaneously, all threads should see either A appearing 
before B or B appearing before A in the equivalent sequential order. So, these are the two extra 
conditions, extra caveats that are added in the case of linearizability which SC did not have, so, 


they make it like more in line with real-time constraints. So, this is something that SC did not have. 
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So, that is why we had a little bit of an odd behavior as you could see over here because we 


expected Ry0, this was the real time order. But this was the execution results that we got, which is 
fine from an SC standpoint. But this is not fine from a linearizability standpoint, whereas if you 


take a look at this execution, this execution is linearizable, it is also in SC, but this execution the 


first one is an SC but it is not linearizable. So, linearizability a linearizable executions are 


essentially a subset of SC. 


So, basically it sees a superset of linearizable executions say anything which is linearizable is an 
SC but not the other way around as we saw just over here, that this is an SC but it is not linearizable 
but this is linearizable as well as an SC, well, why? Because we can create a sequential order like 


this Wx1 then Ry0 then Wy! and then Rx1. 


So, as you can see this order is legal, this order respects program order and it is atomic, so, 
atomicity is there. So, given the fact that we have discussed atomicity how atomicity is added 
mixed with reliability and program orders to create SC? And how we add real time ordering to 


create linearizability? We can now move forward. 
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So, let me nonetheless just summarize with the little bit of extra space that I have over here and I 


will do that. So, the properties we have our atomicity then we have program order, legality, so, 
these were properties of SC, fair basically you have program order atomicity and every read needs 
to return the value of the latest write, if with this we add real time orders then what we get is 
linearizable. So, we have discussed all of that for the sake of, for the purpose of discussion of the 
CAP theorem, we will only use the only consistency guarantee that we will use will actually be 


atomicity. So, that is what we will use. 
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And we will show that, that itself is sufficient to almost prove everything. But since, we have the 
opportunity; let us discuss other types of consistency as well. So, you can have causal consistency, 
which means, let us say if there is a write, and then there is a read. And after that, let us say there 
is a write, so, there is a program order relationship between this read and write. So, then everybody 
should see this write happening first and this write happening later. It should never be the case that 


you do a read after this, you see the results of this, but you do not see the results of this. 


That is because two operations are related with cause effect relationships it is causal. So, you wrote 
and I read what you wrote, so, we have a cause effect relationship I executed, that is the reason 
you executed again, we have a cause effect relationship. So, as long as there is some sort of a cause 


effect or a causal relationship, we will expect them to be seen in the same order by all processes. 


But, if you do not have causal relationships, they can be seen in any order. So, a wonderful example 
can be given to illustrate the point, let us say we are writing x = | and then we write x = 2, one 
thread reads X so, it sees first 1 and then 2, well, that is fine. So then, as far as we are concerned, 
if atomicity were to hold, then x would be written first with 1 and then with 2, but what if another 


thread observes Rx2 and Rx1. So clearly, in this case, the atomicity of x is not holding. 


So, it appears that we are updating x in a different order. So, both the threads are not agreeing with 
respect to the order in which x was updated and as you can see there is no causal relationship 


between this write and this write. So, causal consistency at this point basically says you can see 
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them in any order which the threads are doing. So, this is clearly not sequentially consistent this is 


clearly not linearizable but, it is causal consistent, causally consistent. 


So, causal consistency is weaker, the same way that SC was weaker than linearizability, causal 
consistency is weaker than sequential consistency, so, we can again define this in a different way, 
let us say that we are maintaining many replicas of a variable. So, we have multiple servers, and 
they are maintaining many replicas of a variable x, you want to read it or write it. So, the replicas 
themselves could be loosely synchronized but let us see, if you read a value from the set of servers, 


you may get a replica which is stale, in the sense its value is old, it is inaccurate. 


But in certain conditions, or let us say in this continuous consistency model, it is guaranteed that 
whatever is the value of x that you get, and whatever is the real value of x, which you should have 
gotten in ideal circumstances, the error is bounded. So, again, this is not a very generic model it is 
true only for some cases where you are dealing with things like clocks or counters and so on. But 
again, this is a consistency model where you look at the difference in the values and you try to 


bound that, you try to place a bound on that. 
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So, we will now discuss a few more kinds of consistency models from the point of view of a client, 
so, who is a client? So assume that you are accessing Gmail. So, Gmail is a server, my mobile 
phone is the client, or you are accessing an office network that has lots and lots of documents. And 


I may be accessing it using my mobile phone or laptop or desktops in that case, my mobile phone 
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or laptop is a client. So, from the perspective of the clients, up till now, we are looking at it from 
the perspective of the system. But from the perspective of the client, we could have many 


consistency models so, let me discuss a few. 


So, let us look at monotonic reads, so, if a certain value for a variable was read, so, in the sense, 
let us say, you read an email, subsequent reads by the same process, the same process means the 
same mobile phone, where it may connect to the same server or a different server will yield the 
same value or more recent result. So, which basically means that if let us say, connect to Gmail, 
and I am able to see an email, half an hour later, I connect to Gmail again, but internally Gmail 


connects me to a different internal server, I should still see that mail, it should not go away. 


If this does not happen, what will happen is that I may see a mail now and sometime later, I may 
not see the mail. And later, when we connect to another server, we may find the mails to be missing, 
so, then there will be an issue. So, this is the monotonic reads, which is purely from the perspective 


of the client, we are not looking at it from everyone's perspective, as we were doing earlier. 


We are gonna monotonic writes, which means write operations made by the same process happen 
in order. What would that be? So, let us say, in on Twitter, you would have seen many of these 
tweets, 1/2, 2/2, and so on, which is like the first part of the tweet and second part of the tweet. So, 
this I do from my Twitter client, but what if Twitter shows the second part of the tweet first and 
the first part of the tweet? Along with being incorrect, it can lead to quite embarrassing situations, 
this is definitely not what you expect and that is why this consistency model from the point of view 


of aclient is called a monotonic write. 


So, clearly, you can see that if atomicity is there, monotonic reads are not a problem, because once 
you have read, you will always read it. Monotonic writes are not a problem, because once you 
have written you have written, so, it is appeared to happen at an instant, so, the moment the first 


tweet is done, you can do the second and it will remain that way. 
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So, I can extend this to have two more models read your writes and writes follow reads. So, read 
your writes is like this, that if a write operation is done on a process by a process and item x, any 
subsequent read operation by the same process will read will at least fetch the write that in its own 
write, or something newer of course, so, why am I saying that? Well, the reason I am saying that 
is let us say you compose a mail and you store it in drafts and later on, you read the drafts folder 
you are supposed to see the mail. If you do not see the mail, you will sort of see the mail vanish 


which again is something that atomicity will not allow. 


That a fundamental level, if atomicity holds, you will never see that. But if let us say you have 
different email servers, and one email server holds the drafts folder, and it has not broadcasted the 
updates to the other this indeed can happen. Then we have writes follow reads. A write operation 
that overwrites a variable will overwrite fully or partially the version that was read, so, that is 
correct. So, let us say I read something I read and, I read a document or I read a draft email, and 
then I make a change. So, what am I reading? The object that I am reading is a draft, and then 


maybe I make a change on it I change a sentence. 


So, Iam doing write, so, what will happen is that if I have read something, the write operation will 
only overwrite the draft that I have read, it is not that the draft is going to change. For example, 
what I could have done is that for the same draft? I could have saved it twice, a smaller version 


and a bigger version, so, of course, at the end; the bigger version is the more recent one, so that is 
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the one that should stay. So, what should happen is that the next time I open my drafts folder, I 


should not see the smaller version, I should see the bigger one there I make a change. 


But what happens if while I am writing, the write goes to the smaller one and not the bigger one? 
Again, it can happen if I have multiple servers in these versions or these replicas of the draft mail 
are there in different servers and that could cause an issue, but in a good distributed system it 
should not happen. Of course, if you have atomicity, this will always happen, this condition will 


always hold. 


Again, one more thing you have reactions to tweets where reaction is a write but that should only 
happen after the original tweet has been read so, it should not be the case. You read a tweet, you 
react on it. But then you see that the reaction is there, but the original tweet is gone so, that again, 


should not happen and atomicity would guarantee that this does not happen. 


So, what is the crux of our idea? The crux of the idea is like this, that regardless of your consistency 
SC linearizable causally consistent, we also discussed continuous consistency, where we looked 
at errors, but ideally, the error should be zero. Then we discuss client centric models, different 


client centric models. 


So clearly, in the first two atomicity was common, causally consistent we did not see atomicity 
that is why we did not like the results that it gave. Continuous again, if you have atomicity and not 
have an error, so, atomicity is definitely desirable. And client-centric, if you have atomicity, then 
you will not have any problem. So, that is why in the CAP theorem they have taken atomicity as 


the definition of consistency, by and large. 
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So then, after we have discussed the consistency models, we should cap this discussion with a 
discussion on quorum. So what is a quorum? A quorum, let us say quorum in a meeting is what? 
So quorum is basically the number of the minimum number of people who need to be present in a 
meeting to make a decision. So, we can do the same if you have a bunch of servers, what we can 
do is that we can have a write quorum of Ny servers. So, the write quorum is more than a majority 


so, you will know that at least a majority have the latest update. 


And so basically, number one that insulates you against failures and we will link this with a read 
quorum, where a read quorum is a set of Np servers that choose the most recent value. So, to 
guarantee that the most recent value is read, the only condition that needs to hold is Np + Ny > N 
which basically will then ensure that the read quorum and the write quorum have an overlap. And 


the overlap will provide the latest value of the write. 


So, of course, the implicit assumption in the definition of a quorum and these quorum based writes 
is that when you write you actually broadcast it to the entire quorum all of them update it, when 
you read, you actually read from the entire quorum and every write is associated with a timestamp, 
where higher the timestamp no more recent write is and the value that is ultimately returned is the 
latest value or the value with the latest and greatest timestamp. So, this does increase the overhead 


within the server form because write now has to be broadcasted, or multicast it to the quorum. 
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And a read is also required from everybody from the read quorum such that you can get the value 
of the latest write, but this is a price to pay if we are using a set of servers for both performance as 
well as redundancy. And of course, if we have more servers it will make the data more available 
but ensuring consistency becomes difficult. So, you can clearly see a consistency and availability 


trade-off over here. 
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So, the CAP theorem is all about trade-offs, so, but before that, what is the final word on 
consistency here? It is that sequential consistency or other similar variants and client-centric 


models everybody rely on atomicity. So, we will also primarily focus on atomicity as they see in 


consistency. 
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The rest two are easy, so, rest we have discussed that is not a big deal. The first is availability, 
features that every request received by a correct node must result in a response, which means 
requests must terminate. So, writes are requested termination is guaranteed in the sense that every 
request has to result in a response. The second is partition tolerance, where a partition means that 
all messages sent from one partition to nodes in another partition are lost. And in spite of that the 


system needs to work, as well and as nicely and as correctly as possible. 
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So, what is the theorem? The key theorem is we cannot guarantee both availability as well as 
atomic consistency for a read-write object. In an asynchronous setting, where messages can be 
lost, in any asynchronous setting, where messages can be lost, or a partition can be created, we 
cannot guarantee availability and atomic consistence okay, why not? So, assume that you know, 
messages are lost in a partition is created, we call one partition G, and the other G2. So, all messages 


between G, and Gz are lost. 


So, if a write occurs in G, and later, for the same variable, there is a read in G2, clearly a stale value 
will be returned, because you are guaranteeing availability, you do not have the option of sitting 
on the request, you have to return whatever you have and what you have will be a stale value 
because you never got a chance to get the most up to date value. So, this would clearly violate 
atomicity, because it does not appear that the write occurred at an instant, because the write 


occurred and much later, you made a read from G2 and what you got was a stale value. 


So, the issue is that in this case, you cannot guarantee both availability as well as atomic 
consistency, and the reason being that it is not possible to propagate the write to G2, and that is 
why a write will not appear to complete in an instant. And given the fact that nodes in G2 are bound 
to return a value by the availability requirement, they have to return a stale value, which is going 


to break atomicity. So, what we can see is ensuring cap consistency, availability and partition 
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tolerance. In this setting where messages can be lost and we have an asynchronous setting is not 


possible. 
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Let us have a corollary. So, we cannot again guarantee both availability and atomic consistency 


for a read-write object in an asynchronous setting where no messages are lost, so this is much 
stronger, this says that even if no messages are lost, we still cannot guarantee both of them. The 
argument is the same as the FLP theorem. Given the fact that we have an asynchronous setting, a 
node does not know if a message is lost, or the sender is just slow. So, this was the key aspect of 


the FLP result. 


So, if you go back to that, where we argued that in an asynchronous setting, you simply do not 
know, if the node is still replying, or if a message is genuinely lost, or the node has developed a 
fault the node is dead, it is not possible to find that, given the pause, it is not possible to find it. 
This situation is no different from the earlier situation where messages are lost, because you cannot 
find out anyway, whether messages are lost or nothing is lost, given the fact that it is not possible 
to find out. This situation is exactly similar to the previous one. In so far, as the perspective of the 
node is concerned and in the previous one, we could not guarantee availability and atomic 
consistency, so, we cannot guarantee it over here as well. Another argument, assume the earlier 


case where messages can be lost. 
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No problem, so, do one thing, do not lose them whichever messages you intend to lose, so, we are 
looking at a master oracle which has control over the network just keep them in a cold storage, do 
not deliver them keep them in a cold storage, so nodes will perceive that these messages are lost, 
but they are actually not there in a cold storage. At some point from the proof we will see a non- 
atomic execution. For some example, it will we will see a break in atomicity, at that point release 


all the messages in the cold storage. 


So, as far as we are concerned at this particular point, when all the messages have been released 
and also delivered, no messages are lost because the cold store is empty and every message is 
accounted for but we have still seen an atom non-atomic execution. So, no messages are lost yet 
the execution has still been seen to be non-atomic. So, this means that even if there is no message 


loss, we can still construct an example. 


Fair if availability is guaranteed atomicity has to be compromised, so, there is a trade-off between 
them or rather both cannot be guaranteed. So, which essentially means in asynchronous setting 
regardless of the reliability of the network, so, whether your partition tolerant or not availability 
and atomic consistency both cannot be guaranteed or if we extend this argument in the CAP 


theorem all three cannot be guaranteed. 
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Can we guarantee two out of three? Well we very easily can. Atomic and partition tolerant. Well 


have a central node and ensure that all the updates are there with the central node it maintains up 
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to date state if you can reach the central node return a response which is guaranteed to be correct 
in the sense that will be atomic. If you cannot reach the central node from a partition, well do not 
return anything given the fact that availability is not required, we are fine. So, basically if the 
option is there or not returning anything, then of course, correctness will always be guaranteed 


because we will never return something which is false, which is wrong. 


Atomic and available. Well, same idea have a centralized node for a single partition that will solve 
the problem you will both be atomic and available do not care about the rest of the partitions 
because we do not have to be partition tolerant. So, other unreachable partitions can be ignored. 
Available and partition tolerant. Well, in this case there is no consistency at all. So, for that matter 
any value can be provided any stale value can be provided given the fact that atomicity is not being 


considered here there is no problem. 
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Now, let us look at a different model. So, we have been looking at an asynchronous model up till 
now, let us look at a partially synchronous model, which we will also discuss in other papers like 
stellar and so on, where clocks are loosely synchronized, there is a bounded clock skew, all 
network messages are either delivered within T message time units or they are lost. So, the network 


is also reliable, clocks are also reliable. 


So, the theorem is we still cannot guarantee CAP, so, the all three cannot still be guaranteed. The 


idea is the same; we divide the network into two disjoint partitions G, and Gz. A read will happen 
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in one component or write in the other as we have seen. Given the fact that the right cannot 
propagate to the other component, let us say from G, to G2 the write cannot go, we will never be 
able to provide the up to date values atomicity will not hold. If atomicity one does not hold you 


the C does not hold over here, then CAP cannot be guaranteed. 


So, in this case, regardless of whether you are asynchronous or partially synchronous, we still 
cannot guarantee all three. If you are let us say, even if the clocks are synchronized and the network 
is reliable to the extent that it tells you within T message time units either delivers a message or 
loses it forever, so, there is no cold storage here. Or it will be incorrect to say no cold storage, or 


rather; we cannot use the cold storage theoretical trick to prove something in this case. 
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Now, let us look at the corollary in terms of the partially synchronous model. Assume that no 
messages are lost. If no messages are lost, is there a way out? Well, it turns out to be yes, use a 
centralized scheme, which is a single central server that maintains the overall state and it also 
answers queries and consider a basic read-write object. So, a single central server here multiple 


client machines, all of them issue read-write request to the central server and then, return the result. 


So, unlike the asynchronous case, here, we can detect message losses in the sense if I send a 
message and expect an acknowledgment and a wait for 2tmsg + tproc to process my original 


message. I will get to know some things, I will get to know whether you know the message was 
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sent and act was received otherwise I can again send the message again. So, let us see if all 


messages are delivered, if all messages are delivered, I will always get the act back. 


So, let us not consider the case where either the message is lost or the act is lost. But let us consider 
cases where all the messages are delivered in that case and get acknowledgment back there is no 
message loss. If that is the case and availability is guaranteed a response will come. So, in this 


case, the response will come, and the response that will come will be correct. 


So, we will know if there is no message lost, we will definitely expect a response within a finite 
amount of time and if you still use the central server base scheme, this is going to work. Otherwise, 
a best-known value the stale, best known stale value could be returned. But again, in this case, if 
there is no message loss, then this is a different case, because in this case, message loss can be 
detected. And given the fact that it is a different case, and if we have availability responses will 


come and the responses will also have the correct value. 
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So, conclusions and an extension, so, the CAP theorem limits what can be done in a distributed 
system, it really cannot have C, A and P all together. So, this has been extended in 2010 to the 
PACELC theorem. So, this actually says something more, so, it says that look at the network is 
partitioned. A trade-off exists between availability and consistency and this we have seen, this we 


saw in our discussion, that even if a network is partitioned, then of course, there is a trade-off. 
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If it is not partitioned, so, if let us say that, all nodes are reachable without partitions, then there is 
a trade-off between else so, this E is for else, then there is a trade-off between latency and 
consistence, you can always have a system with no consistency that is going to be very fast. For 
example, if you have a set of servers, and instead of all of them maintaining a replica, you could 
just say that, you know your query some server and return a stale value then the latency will be 


very low. And the consistency will also be quite weak. 


But if I made the consistency quite strong in the sense, if I make this linearizable then of course a 
lot of messages etcetera, have to be sent between the servers to ensure that we are genuinely getting 
the latest value. So, in that case, the latency will shoot up, so, basically what is required in a 


distributed system, it could be SC, so in SC you will get a solid correctness model. 


We have not discussed eventual consistency, but there is not a bad place to discuss it. So, eventual 
consistency means that writes are ultimately visible in the sense that if [am writing to something, 
I do not care about the order of writes and other aspects but writes are ultimately visible. So, you 
have an eventually consistent system where rights are ultimately propagated by latency could be 
quite fast. So, as you can see, there is a trade-off between L and C particularly when we do not 


have partitions. So, this is a PACELC theorem. 


853 


(Refer Slide Time: 48:06) 


References 


* Gilbert, Seth, and Nancy Lynch. "Brewer's conjecture and the 
a feasibility of consistent, available, partition-tolerant web services." 
~ Acm Sigact News 33.2 (2002): 51-59. 


* Golab, Wojciech. "Proving PACELC" ACM SIGACT News 49.1 (2018): 
73-81. we 


And this also was formally proven as you can see, so, the first paper by Seth and Lynch was the 
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Welcome to the lecture on Ethereum. So, Ethereum is, well, it is not that new as a blockchain, 
but at least it is newer than Bitcoin 5 years newer. So, Ethereum is nowadays being used for a 
lot of other applications other than cryptocurrencies. So, at least as of 2022, when I am 


recording this video Ethereum has become a default mechanism for all FinTech transactions. 
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So, a brief look at the history Dwork and Naor proposed the idea of proof of work, which was 
way back in 1992, where the basic concept behind proof of work was that, to you just cannot 
be a passive participant in an algorithm. So, when you say that a certain transaction is valid, or 
let us say a certain account has a certain balance, you need to do some amount of work before 
you can actually say that. So, what does this ensure this basically ensures that if 51% of all the 
nodes, or 51% of the work that is done is honest work, you can guarantee that the blockchain 


even without permissions will still work. 


So, it took some time it took 11 years for Vishnu Murthy and his group to use the proof of work 
to secure a currency in the sense that from 1992 to 2003, the idea of using this for 
cryptocurrencies had not gained a lot of traction. Then Bitcoin came in 2008, Satoshi 
Nakamoto. But again, that is not the name of real person. It is a pseudonym. And then we have 
seen a deluge of cryptocurrencies as we talk. There are a lot of cryptocurrencies people actively 
buy them, sell them, buy futures contracts, hedge on them, and so on. So, Ethereum came in 


2013, it had a different design philosophy. 


So, unlike being made by one person, it was initially proposed by Vitalik Buterin. But then 
many people like Gavin wood, and so on, all of them came in. So, now it is a big group that 
actually manages Ethereum. It is much more than a digital currency, if you think about it, it has 
generalized the idea of a digital currency. So, when we transfer some money from one account 
to another, so we have account A and account B. So, when we transfer some money here, 
effectively we are computing the new balance (A — x) where A is the amount of money 


transferred, and B=B+x. 


But of course, it is not as simple as this, because then we are seeing that if the new balance is 
less than 0, which means the transaction is not valid, then we will restore it to what it used to 
be and the transaction is not valid. So, if you see, for transferring a small amount, we actually 


need to run a piece of code like this. 


So, even a basic money transfer is actually the execution of a code. And this code can be 
generalized in the sense, you can say that look, I will transfer 10 units, in the case of Ethereum. 
It is 10 ethers if the balance is less than 50. Otherwise, I will transfer 5 you could say that, so 


all such pieces of code are known as smart contracts. 


So, that is why Ethereum at least as of today is much more than a digital currency. Because it 


allows users to run arbitrary code on a shared global state. What is a shared global state? It is 
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the state of all the accounts that is the shared global state as of 2018, it is the second largest 
cryptocurrency in terms of market capitalization. And now of course, I do not have the 2022 
numbers, but as of today, it is by far, the most popular after Bitcoin. And it is being used for 
doing a lot of other things as well, like create financial products, including managing digital 


art and collectibles. So, this also can be stored on the blockchain. 


So, this idea over here is non-fungible tokens, so we will see what they are as we keep 
discussing, but before I go further, I would like to say that the entire discussion on consensus. 
The entire discussion on Bitcoin and blockchains is a prerequisite for this chapter. So, kindly 
go back in the playlist, look at all the discussion on consensus, bitcoins and blockchains and 


then kindly come back because otherwise you will not be able to follow the material. 
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Q All the machines (nodes) are share a common world state. It is a shared 
state that is modified by transactions. Every machine has a private machine state. 


So, the idea is we have a world state, which as I said, is a state of all the accounts. So, in unlike 
Bitcoin, which did not have the notion of accounts, Ethereum has the notion of accounts. And 
all of these accounts together comprise or form the world state. Then of course, we have 
multiple machines. So, each machine then has a state of its own and local state, which is the 
machine state, but mind you any modification that is ever done, this change is done to the world 
state. Unlike a traditional cloud-based system, the world state is actually not present, in any 
dedicated cloud machine instead, either the entire world state or parts of it are replicated across 


all the machines. 


So, as we know, in all blockchain based systems, each machine and essentially contains the 


entire state, that is probably one reason why blockchains were not popular in the earlier days, 
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in a prior to 2015. Because machines did not have the kind of storage to at least store the world 
state. But now this has changed. So, every machine will have its private state with a private 
state is valid insofar as the scope of the currently executing transaction. So, now, let us move 


on. 
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So, the basic concepts in an Ethereum like this Ethereum is a blockchain. The blockchain starts 
with a Genesis block. So, Blockchain is like a linked list, it starts with a Genesis block. So, 
Bitcoin was the same as well and we grouped transactions into blocks. So, the blockchain pretty 
much contains all the transactions that are happening in the Ethereum ecosystem. So, as we 
have discussed, every machine maintains some state which is either a slice of the world state 
and some internal private machine state as well, there are two kinds of functions that each 


machine can apply on to the world state. 


So, the first kind of state transition functions as we have seen, which is basically transfer some 
money from account A to account B. So, that is one kind of a state transition function. So, this 
is basically changing the state of the other world, why? Because we are changing the state of 
account A and account B, because we are changing the balances. So, this is a state transition 
function, the other is an account creation transaction. So, in this case, what is happening 1s that 
we are creating a new account and every new account also has some associated code with it in 


the sense that every state transition is simulated as a message call. 


So, in this case, if let us say account A wants to add some money into account B, it will send a 


message to account B and account B will have some code associated with it, which will process 
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the message and this code was initially made a part of account B by the account creation 
transaction. So, that code will see that, new money is coming in what do I need to do this is 
what I need to do in this case. So, basically, let us say B.bd + = x where x is the amount. So, 


Ethereum has a built-in currency the same way we had, currency in Bitcoin. 


The lowest denomination is a Wei and many a time Wei actually the protocol relies on the 
lowest denomination, which is a way one way and then of course, as you can see, we have 
some very large denominations, which are actually in a much, much larger so, 1012 Wei’s. So, 
1012 Wei’s is a Szabo, 101° Wei’s is a Finney say incidentally, they are all founding members 
of Ethereum and 1018 Wei’s is an Ether. So, we will see that the term ether is being used quite 


a bit later on. 
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So, the word state is like this. The word state if you think about it is a simple mapping, where 
every account, every account is represented by its address, which is 160-bit ids. And that is 
mapped to the state of the account or the account state. So, you have an account number and 
an account state. And this can be extended to be a generic mechanism. It is not the case that 
every account needs to have some amount of money or ethers in this case, but it could contain 
anything, it could contain digital art, for instance, it could contain insurance products, it could 


contain, anything that you want. 


So, here, the key point is that the account states have to be maintained by each machine in the 
sense every machine has to maintain a database. So, this is quite similar to Bitcoin. But it is 


just that Bitcoin did not have the notion of accounts, Ethereum has the notion of accounts. 
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Furthermore, the protocol suggests a data structure, the Merkel Patricia tree, which we will 
look at, to maintain the state database, which is nothing but a large database of key value pairs, 


basically. 


And so, we will have similar data structures, we had a Merkle tree and Bitcoin, if you think 
about it, and we will have something similar over here. So, I said the contents of the root node 
are dependent on the state of the entire tree. So, we will use this as the same concept, but we 
will use this as a common concept for looking at the account states of Ethereum. But essentially, 


the way that we perceive the world states of Ethereum are a set of key value pairs. 
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So, what is a Merkel Patricia tree, or it is also called a Trie? So, let us understand what is a 
Trei? So, a Trei or a Patricia Trei is a data structure, which is used to store a bunch of strings. 
So, as you can see, let us see if the bunch of strings that I have are as ‘to’, and let us say ‘te’ 
and ‘do’ the way that I would actually store them is that I will start from the root node over 
here, which is over here, and create a prefix tree. So, the prefix tree so we will look at the first 
letter over here, match that with the first letter over here, if a and a match, then we come here 


and then again s and s match. So, we come here, and similarly, we look at ‘to’ and ‘te’. 


So, what we do is that we create a prefix tree over here. And so basically, the idea is that the 
Trei is a representation of the prefix tree where if you want to find if a given string is there or 
not, then all that we do is we start with the first letter and we traverse it. Optionally, this can be 
used to store key value pairs, how do we do that? Well, we use the same idea we treat the keys 


as strings, all the strings are stored in the prefix tree like this and the values are other leaves. 
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So, the values are where, a full string for a key terminates. So, the values are at the leaves over 
here. So, as you can see a Merkel Patricia Trei can be used to store key value pairs in this 


fashion. 
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What is a Merkle tree? So, we have seen this in the case of Bitcoin as well. So, it is basically 
used to store a hash of a bunch of data not one data, but a large amount of data, we want to 
store a hash of it. So, that if there is a small change, and there are n data items, it will take order 


login time to propagate the updates. 


So, in this case, what we do is that we take a data item with value A, we hash the item. So, we 
call it hash (A). And similarly, we create a tree like this where the individual data items are B 
and C, we create a hash value hash (B), (C), and the root basically, so what happens is every 


node contains the hash of its children. 


So, this node contains the hash of (A), hash (B), C contains the hash of its children and the root 
furthermore, will contain the hash of the concatenated value of hash of (A) and hash of (D), C. 
So, in this sense, the contents of the root represent the state of the entire tree. So, this is not a 
Trei this is a tree the earlier one was a Trei. So, this is the state of the entire tree and end state 
of the entire tree is represented over here in the root the contents of the roots if anything 
changes, then of course, it will be very easy to propagate the new change to the route in roughly 


order or login time. 
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So, the Merkel Patricia tree is like a union of both the Merkle tree and a Patricia try Trei. The 
leaf nodes are the key value pairs, and they can be used to store anything. For instance, the key 
can be the account id or the transaction id, or even hashes of them. But the key point is that 
whatever we consider the key anyway, the MPT tree does not care. The keys used to traverse 
the MPT tree the same way you would traverse a Patricia Trei exactly in the same way you 
would traverse the tree. So, that is what the key is used for. And we traverse it a nibble at a 
time. So, let us see if it is an 80 bit key for instance, then we divide it into 4 into 20 nibbles 


what is nibble? Nibble is 4 bits. 


So, what we do is that we divided into nibbles, we will have 20 such nibbles. And it will 
basically be a 20-level tree and will traverse it and nibble at a time. And every node because 
we are dealing with in terms of nibbles will have 16 possible children. So, every Branch node 
in the tree will have 16 possible children over here. But there is a catch. So, the catch is that if 
we look at it, who wants it with a level tree that is too much. So, many a time a large part of 
the value space will be sparse. In the value space is sparse. The MPT construction and Ethereum 


discourages nodes with a single child. 


So, it discourages any kind of an internal structure like this. Is a much better idea that if this is 
the internal structure, then this actually points to this point where this is a fused node. So, the 
fuse node basically will indicate what are the values additional values of nibbles in this in a 
chain over here, for example, let us let us consider in the hash values. So, if this is A, and let 


us say this is 5, and this is 6, and 7. So, we will basically create one fuse node, where we will 
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store the fact that well, we enter it using the AH. And all the children for this position have 5. 


And then of course, and after that we can have branches. 


So, this fuse node is called an extension node, which is collapsing the chain of nodes into a 
single. So, what we will have after this is we will have this as an extension node, we will store 
5 over here. And then again, we will have two of these children over here 6 and 7. So, the value 


space is expected to be reasonably sparse. 


So, that is why we will have many of these extension nodes in the MPT tree, which will reduce 
the storage and as well as the traversal time substantially. So, this was a Trei part, this was a 
Trei aspect, the PT part of MPT tree. But also, what we do is that we have the same idea that 


an internal node stores the hash of its children, so on and so forth. 


So, we create what is called a root hash, which is basically the hash stored at the root. And the 
way this is done is that we traverse this. So, basically, we, in this case, we treat the Patricia Trei 
as a Merkle tree, we traverse it using DFS, we compute the hash, every parent computes the 
hash of its children. And then finally, we have the root hash. So, the advantage of doing this is 
that if let us say at any point, there is a change, then incrementally, we can update the root hash, 
we do not have to traverse the full tree. So, we just compute the new hash for the internal node 


and kind of propagate it up. 


So, this is why we have a Merkel Patricia tree where it is easy to add a node it is easy to search 
for a node as you can see, so this acts like a search tree as well. It is kind of compressed because 
of these extension nodes. And furthermore, we maintain internal hashes as well as the root 
hash. So, this completes the definition of the Merkel Patricia tree or the MPT tree. So, the MPT 


tree is quite important in Ethereum because it is used to store many, many things. 
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So, what are the things are given the account, mapping it to the account state is anyway what 
the world state is. Which is nothing but a mapping from account ids, 168 account ids to the 
account state. So, of course, we stored several things as a part of that count state. So, the word 
nonce will appear many times in Ethereum. And every time it will mean a different thing 


unfortunately. 


So, the account state the nonce basically means the number of transactions that have been sent 
to an address either that is this the number of transactions that have been sent from this account, 
or the number of times in this account, you have created a new contract a contract is a piece of 


code basically, balance that is in a number of way owned by the account storage root. 


So, the storage root is where we actually store the state of the account. And this is again stored 
as an MPT tree. So, the root hash of MPT tree is known as the storage root. And so, this stores 
the contents of the account or several accounts, the code hash is nothing but we take all the 


code that is associated with the account. 


So, this code is also known as a smart contract, but we will refer it to as just code in this 
presentation. So, this code is anyway targeted towards something called Ethereum virtual 
machine, which is like a virtual machine, which is meant to run Ethereum code regardless of 


the platform. So, the hash of that is known as the code hash. 
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So, how do transactions work? So, the way that transactions work is that a message is 
transferred from one account to the other and the other account processes it. So, what is there 
in the message? Well, we can have, structured data and unstructured data. So, the unstructured 
data the field is data, where, you can send anything function arguments, anything that you want, 
in absolutely anything that you want, can be sent over here as a part of the message transfer, 
which is the data field of the message transfer. So, as we have discussed earlier, there are two 


kinds of transactions in Ethereum. 


There is a message transfer, and we can also create a new account or contract creation. In that 
case, what we will have to do is, so since every account will have some code associated with 
it, what does this code do? Well, the code is invoked when a message is received. So, when the 
account receives a message could be any message could be credit yourself, debit yourself could 
be anything. In that case, the code that runs is known as the body. And this code has to be part 
of the account creation or account initialization process. There are a few more fields, so we 


have seen nonce, so nonce here is also the means the same thing. 


But as I have said, we will see the term nonce appeared in several places, and it will mean 
different things. So, in this case, it is a number of transactions sent by the sender. So, we have 
two very interesting things in Ethereum. So, let me explain the motivation behind them. So, 
what is the key point? The key point is that everything in Ethereum is being monitored as code 
execution, including transferring money. Because, as I have just said, transferring money is not 
that simple, basically, because you will have to run a small program at the sender side, a small 


program at the receiver side. 
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So, the issue is that now we need to provide an Ethereum virtual machine to run the code as 
well as the language for the code. So, when you are running the code, the code may go into an 
infinite loop, there may be faults to the code, all kinds of issues that happen with code. See, 
even though the language that runs on the Ethereum virtual machine is Turing complete, we 
still want to charge a premium for the amount of code that executes in a sense, we do not want 
people to write very heavy code. So, that will slow down the entire Ethereum system. So, we 


define the notion of gas. 


So, gas actually stands for gasoline, so they call petrol gasoline in the United States. So, 
basically, per computational effort are known as gas, the number of Wei we need to spend. So, 
what happens is that as code keeps on executing, depending upon the instructions, you are 


executing, how much memory and storage you are using, gas continuously gets deducted. 


So, the idea here is to basically create code, which will not use a lot of gas, then there is a gas 
limit. So, the idea the gas limit is that look, if you have said that you will use 20 units of gas, 
you cannot cross that crossing, that basically means you may be moving towards an infinite 


loop. And we do not want the system to hang. 


So, the moment you run out of gas, the transaction stops. Furthermore, for the node that is 
initiating the transaction, it needs to actually pay. So, the idea is that if you want to initiate a 
transaction, and you want Ethereum to accept the transaction, you have to pay the gas price, 


multiplied with the number of gas units that are actually spent. 


But since an upfront payment is being made, and at the moment, we do not know how much 
of gas will be used. So, whatever is the maximum gas amount that you are seeing, multiply that 
with a gas price paid upfront, later on a refund will be issued, if less gas is used, then we have 


the standard fields to is the recipient of the message. 


And because Ethereum is still has its cryptocurrency DNA, it does have value as a common 
field that you are referring fair, you are kind of using it to transfer value. But if the Ethereum 
blockchain is being used to implement something that is not a cryptocurrency, then we need to 
use the data field for arbitrary data can be specified. So, the key idea, again, is that we have 
accounts, we have messages that are sent from accounts. And clearly in the case of currency, 


these are credit debit kind of transactions. But as I said, Ethereum scope goes well beyond that. 


866 


(Refer Slide Time: 25:49) 


The Block 


* Contains a list of transactions + other information 


Defy 


So, let us now come to the block. So, as we have seen in Bitcoin, a block basically contains a 
bunch of transactions can contains a list of transactions, along with other information. So, here 
is the fun part, unlike Bitcoin, that actually always store a linked list in the Ethereum is going 
to store a tag. So, you basically store the fact that, well, this could be the longest chain. So, we 
will again discuss what is the longest chain in some detail. But it is very well possible that at 
some point of time, other nodes tried to add their blocks. But they were not successful in the 


sense that they were not part of the longest chain. 


So, Bitcoin used to just discard them. But in this case, we do not discard them. In case you lose, 
you still get a consolation prize. So, what you do is that Ethereum kind of maintains a truncated 
tag, where this could be the longest chain, so we will define what the longest chain means at a 
later point. But again, there will be a few of these blocks, which were not a part of the longest 


chain, but did get added at some point. 


So, what we will do is, we will remember them, and we will compensate them for the effort 
that was put in. So, such blocks are known as ommer. So, ommer is basically a gender-neutral 
term for uncle or aunt. And so, we will refer to ommer as uncle. So, if you see if this is the 
block, then an ommer is basically a sibling of its parent, it is still not on the chain that is 
validated, it is still not on the blockchain. But the fact is that an unsuccessful attempt was made 
to add the ommer to the grandparent, but since the parent one, the ommer got kind of thrown 


out. 
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But the good thing here is that once this block is successfully mined and added to the chain, 
we will kind of be nice to the ommers and give them a small amount. The small amount of 
currency will be given to them, kind of like a consolation prize. So, this has advantages in terms 


of preventing starvation and so on. 
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So, what are the list of fields in blocks? Of course, will have a bunch of transactions, but we 
have a lot of other things as well. So, of course, we have a hash of the parents block headers. 
So, this is what makes the blockchain and blockchain because every node in the blockchain 


contains the hash of its previous nodes. If this is the node, this is the spirit. 


The moment you store a hash of this, this becomes a blockchain. We also maintain a list of 
ommers. So, we store the hash of the list of ommers. A beneficiary in a sense, if you 
successfully mined the block who gets the fees. Then we will have a bunch of MPT trees, so 


we will actually have three. 


So, the MPT trees over here are known as State Route transactions route and receipts route. 
The State Route basically means the root hash of the account state, as simple as that, that if I 
take the world state, and then I take its root hash, then I am linking a block with the root hash 
of the world state to indicate that, once all the transactions for this block are processed, this is 
the world state. And given that the world state itself is stored as an MPT, it is very easy to kind 
of get a handle to the world state which is the root hash of the trial. Then Furthermore, a block 


contains a bunch of transactions, so which are also stored as a Trei. 


Where, again, the key is the transaction ID and the value is the content of the transaction. So, 
in this case, we have this so mind you, this was there in Bitcoin as well, in a slightly different 
form, but the idea here is the same. So, the idea here is that the machine stores the state in the 


sense they store an exact list of transactions and the world state and so on. 


But what actually is there in a block is basically the route hashes. Once a transaction executes 
at exit the code finishes executing, it will generate some return value and some logs. So, that 
together is known as the receipt that is again stored as a Trei, which is basically again the key 
value here is the transaction id and the transaction receipt. So, the root hash of that is also 


stored. 


Furthermore, what happen is if you actually think about it in an Ethereum account a lot of code 
runs. So, it is possible that there may be bugs in the code, maybe it did not run correctly, maybe 
it had issues, exceptional conditions and so on. So, in this case, we would like to store logs. So, 
logs again are defined by the code, but essentially what you store is you store a log topic and 
some sort of a checkpoint on the machine state such that you can kind of go back and debug. 
So, this is stored in a data structure called a Bloom filter. So, we will discuss what is a Bloom 


filter in some detail in a subsequent slide. 


But the key idea is that some debugging and checkpoint information is stored in the block, 
because the block is after all associated with executing code. So, it is important for us to 
understand if something went wrong, where did it go wrong, and store enough debugging 


information including, checkpoints and watch points. So, this information is stored. 
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We have few more fields and all of these fields trust me are very important in a blog. So, 
difficulties this is related to the proof of work concept in Ethereum. So, I will not discuss a lot 
about this now, because we will get some amount of time later on to discuss what is difficulty? 
Let us say that it indicates how difficult it is to mine the block or mining the block basically 


means adding it successfully to the chain. 


So, we will see that this is a varying quantity. And it is dependent on the previous block and 
the timestamp and so on. But let us discuss this later. The number is important it provides the 
depth of the block starting from the Genesis block. So, if the Genesis block is numbered one, 


the next one is two, three, and so on. 


So, that is how you decide the chain. It is a blockchain. So, every block has to have a number 
starting from the beginning. Gas limit we have discussed is the total amount of gas that can be 
used. And gas uses the amount of gas that we have used already. So, gas use has to be less than 
equal to gas limit. Timestamp is the time value at which the block was created. It is a Unix time 
value at the block’s inception. In addition, we could have additional meta data because as I 
have argued Ethereum can be used for all kinds of applications many a time the application 


would need some additional data of its own. So, that feature has been provided. 


Then we have two more parameters mix hash and nonce. So, in this case, nonce is not the 
number of transactions and it is something totally different, we are still using the same term 
unfortunately. So, both of these are used to mine the block. So, both of these are used in block 
mining, in computing the proof of work, so we will discuss this later. But kindly remember 
these things as mix hash and nonce to be discussed later. The mix hash by the way, is a 32-byte 


quantity. And a nonce is in the 8 byte or 64-bit quantity. 
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So, let us now come to some things that we have discussed but decided to talk later. So, we 
will discuss transaction receipts and the Bloom filter. So, as we have seen, every transaction 
modifies the world state and may optionally produce an output. For example, you may ask a 
Ethereum to compute the result of some function. For example, find out how many accounts 
are there whose balance is greater than 100 ethers, for instance. So, that can come out here. So, 
the outputs the transactions, the receipts are again stored in a Trei. And the root hash of that 


Trei is stored in the block. So, let us look at what are the transaction receipts that we look at. 


So, that will have the set of logs that were created for debugging the post transaction state in 
the sense after the transaction, what was this world state. The gas that we used, and a dedicated 
data structure called a Bloom filter to store the logs. So, the debugging logs. So, in this case, 
the hash of the world state will act as the key in the Trei to basically find out the transaction 
receipt. So, you say that look, if this is the world state after a transaction, give the receipt to 


me. And the receipt will include all of these things. 
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Let us not look at a log entry. So, log entry is kind of interesting. And the key part of the log 
entry is that this is what is produced by machines while they execute transactions. So, is there 
later on and Ethereum client can query the log entries and find out if certain logs were generated 


or let us say for a certain log topic, the log Topic could be, your printing a message writes a 


message for what was actually presented, what was actually printed. 


So, this could be done. So, the log entry would consist of three fields, who is the logger? What 
are the list of log topics? And what is the log data? So, what we do is that if let us say we want 
to search, you know to have some sort of a data structure to give us a yes, no answer with 


regards to whether a log topic is there or not. 


So, whether a log topic is there or not, basically, yes, no answer. Use an indexing data structure 
known as the Bloom filter. So, what we will do is that we will compute a 2048-bit hash of this 
entire thing using a hashing algorithm. And we will take a 2048 entry array, where each entry 


is | bit. 


And so, let us say the value of this is 1117. We will go to the 1117 entry over here and set it to 
1 that would basically indicate that for these exact contents, I hash it and then in this entry is 
1. So, what is the probability that another, set of logs are hashing to the same entry? Well, that 
is 1 by 2048 which is fine, the number of transactions is limited 32 to 64 in a block, then this 
is the risk that we can take. And so, this will basically tell us that for let us say a given log, is 
it present in this block or not? So, bloom filtered even otherwise is a very standard data 


structure, which is used to find out whether a certain piece of information is there or not. 
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Now, let us come to gas and payment. So, we have seen gas price and gas limit. So, what 
happens is that so this will, why was this done? The reason that this was done was to ensure 
that nobody monopolizes the CPU’s in Ethereum, because every transaction in a Ethereum has 
to be validated, which means you have to run the code, we do not get into infinite loops. So, 
that is why the entire amount, which is the gas price X with the maximum amount of gas that 
can be used for this transaction + value is this + the value is not the given value + the value that 


is going to be transferred, transferred out of the account. 


This entire amount is deducted upfront at the beginning of the transaction. Such that, we always 
have enough money and gas with us. And let us say we did not spend gas limit, but we spent 
gas limit by two, the unused amount is later refunded. And given the fact that we are in a 


blockchain kind of setting, the implicit trust is there. 


So, we will discuss trustworthiness issues. But the main advantage of using Ethereum is that 
as long as a majority of the nodes are honest, you will ensure that you will get back the refund. 
Clearly, if the transaction initiative increases the gas price, it will provide a much higher 


incentive to mine. Why? 


Because the incentive that miners get has been tied to the gas price. So, higher is the gas price 
more will miners actually mined the transaction and make it a part of a block. So, this basically 
means that if you want your transaction to get in, you better pay for it. And how do you pay 


for it by increasing the gas price that will ensure it will get added to the blockchain sooner. 


(Refer Slide Time: 39:15) 


873 


Transaction Execution 
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© Why have the notion of gas? 
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Argider This places a limit on the computation that miners need to do. It ensures 
that they don’t enter an infinite loop or become unresponsive. If a transaction 
runs out of gas, it stops, and the state is reverted back, refunds are issued. 
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Now, let us come to the transaction execution. So, what we have seen every transaction 
maintains a sub state. Along with its execution state, it maintains a few other things. So, what 
it maintains is a self-destruct set, which is a set of accounts that will be destroyed after the 


transaction completes. 


So, most of the time, a transaction may say that look, let me finish. And after I am done, you 
kindly destroy this account, or you will go and destroy some of their accounts. So, for example, 
one such transaction could be that go to all the accounts of students in a school, transfer all 
their money back to the school because that could be their safety deposit, and it could be the 


graduating batch. 


So, then we destroy their account. So, other accounts will then enter the self-destruct set. We 
have seen the set of logs that could be generated and how to index them with a Bloom filter. 
And of course, the refund balance, which is the amount of users the amount of money virtual 


money that needs to go back to the transaction. 


So, why have this notion of gas? Well, as I said, it places a limit on the computation that miners 
need to do. Because every node in Ethereum, or at least the majority have to successfully mined 
the block. Mined, the block means validate the block, validate all the transactions in the block 


and run the code associated with them. 


So, system remains responsive does not enter an infinite loop. And, if you run out of gas, the 
state needs to be reverted back because it is transactions is an all or nothing transaction. So, 
refunds are issued. And because we are in an Ethereum, like setup, it is guaranteed that refunds 


will always be issued correctly. 
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Now, let us come to the basic idea of message calls, how does that work? So along with every 
account, there is some amount of code with the account. And, you could have one function or 
you could have multiple functions, but we will at least have one. So, that is associated code 


with that account, which is known as a smart contract. 


So, why is it a smart contract as opposed to a dumb contract? Well, a dumb contract is 
something that you can say that, you can decide to kind of restrain yourself from the contract. 
But in this case, what happens is that something gets added to the chain, only if a majority of 
nodes agree. So, once an account has been added, and its associated smart contract has been 
added, the moment a message goes to it, the code of the smart contract has to execute. Because 


again, you are assuming that a majority of the nodes are honest. 


So, they will do this on their own. And it will ensure that all contracts are honored. That is why 
it is a smart contract. So, the message will contain what will contain the usual stuff, the sender, 
the originator of the transaction, the recipient, the available gas, the value that needs to be 
transferred the gas price, additional inputs and the call stack. Why the call stack? Because it is 
possible that one account can send a message to another in response, a few more message calls 
can be made. So, we will have a call stack in this fashion and the call stacks of course after a 


finite depth, bounded depth. 
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Execution Model 


* The Ethereum Virtual Machine (EVM) is a Turing-complete machine. 
However, the number of computations is bounded by the amount of 


available gas._ 


* Word size: 67 bytes 


* Maximum stack size: 1024 
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So, all the code that we have been talking about will execute on the Ethereum virtual machine, 
or EVM, which per se is a Turing complete machine. So, Turing completeness is something 
that is studied in computer science, which means it is as powerful as a Turing machine. In 
formal terms, it can execute anything. But there is an important caveat over here, it is a bounded 
Turing machine, because we have the notion of gas, so we cannot have an unlimited amount of 


competition. 


So, the word size over here in the Ethereum virtual machine is 32 bytes. So, that is our 
minimum granularity in memory. The maximum stack size is 1024 entries, it is not very large. 
So, it is not meant for in a huge competition. For storage, we have an independent array of 
words for each word is 32 bytes. And the code as such is stored in a read only memory or a 
ROM region. So, it is not a classical ROM machine, but the code is stored in a place so the 
code cannot be dynamically modified. So, it is not a jet kind of code, the code is static, it does 


not change. 
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So, when is gas charged. So, what happens is that we have some fees, which are intrinsic to the 
operation. So, which means that for every instruction in the code, so, the code basically is 


executed instruction by instruction, every instruction has a certain gas price in ways. So, as we 


keep on executing, we keep on accumulating, yes. 


Furthermore, if there are some big calls like message calls or contract creation calls of that 
type, then what we have is that we have a much bigger payment in terms of gas along with that, 
if he ever asked for more memory more than what is initially given or provided, we ask for an 
increase in the amount of memory that is provided to us, then also additional gas is charged. 
The issue with storage is kind of tricky. The reason is that we use more storage of course, we 


need to pay for it, but again if we kind of free storage, the amount is refunded back. 
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8) 
So, overview of the execution, well the execution there is nothing spectacular about that. The 
EVM is a regular sandboxed execution environment, which is kind of isolated from the host. 
So, that ensures the safety and security whenever there is an exception of any time the EVM 
stops and an exception could also be when the gas used exceeds the gas limit then also you 


stop. 


Machine state as I said is maintained machine state in this case would be the amount of 
available gas, the current program counter of the code, the contents of the memory write 
number of inactive words in memory and so on. So, we measure everything in terms of 32-byte 
words and of course, the contents of the stack. So, that is the machine state and once the 


transaction is done, the machine state reverts back to null. 
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In any blockchain, the most important question is how do we ensure that all of us agree on the 
fact that there is a single chain? So, this is in effect, solving the consensus problem with 
potentially faulty nodes which could be, faulty in the byzantine sense. Since, they could be 


malicious. How do we ensure that? 


So, unlike Bitcoin, what we are seeing is Ethereum maintains a tree, kind of a DAG of blocks. 
We also maintain a list of ommers. So, that is also maintained. So, we also maintain, uncles 
and great uncles and so on, but still there is no confusion in terms of the real blockchain, which 
is a path in the tree. The reason is, it is the path with the maximum cumulative difficulty and 


we are guaranteed that there will be only one such part. 


This is clearly the hardest to mine. And given the fact that this involves the maximum amount 
of proof of work, this is the hardest part and nobody will have any doubts on the fact that this 
is the hardest, and this has required the maximum amount of computational effort. So, it is 
possible to prove that if you have 50 + epsilon percent, it is either colloquially called 51% of 
the computational power is honest, then there will be no disagreement on this path. So, what 
are the rewards? So, in Bitcoin we had a reward for miners. But a problem with Bitcoin was 
the maximum amount of bitcoin that can be generated was fixed, in this case it is not fixed, it 


can be very high. 


So, blocks reward is 5 ether, if which you had actually see is a lot given the fact that the ether 
is 1018 Wei. It is credited the address of the beneficiary. And as I mentioned earlier, the ommers 


also get a consolation prize. So, two ommers are maintained and they are the ones who get the 
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consolation prize, which is 1 by 32 th each of the block reward. So, this is what they get. 


Typically, when a block is successfully mined, the ommers also get something. 
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So, now, let us come to the proof of work. So, in any blockchain kind of system, the proof of 
work computation is kind of its key. So, what was observed in Bitcoin? So, you need to 
understand that, Ethereum came much later. So, it had enough time to actually see what is 
wrong in Bitcoin. So, some of the things in or some other design principles for the proof of 
work function was that in as many it should be accessible to as many people you should not 


use a custom algorithm or something that, somebody with a certain hardware can only do. 


Furthermore, what people started doing with Bitcoin is that they started fabricating their own 
custom ASIC and FPGS. So, the way that the designers wanted to generate a function is that it 
should be ASIC hard in the sense that there is no great benefit of using an ASIC. So, friend is 
the benefit of using a custom chip come when you can break it down into many, many parallel 
components. But in general, it should involve the algorithm was made in such a way that it has 
a lot of sequential steps, it is very hard to efficiently paralyze it. Furthermore, it had a lot of 


memory accesses. 


So, which also ensured that you would need a large amount of memory. And so larger systems 
are unlikely to help, because nobody would have that much of memory. So, this idea was that 
you make it general purpose, ensure that we do not you do not have adequate parallelism that 
will make the algorithm ASIC hard. So, the proof of work algorithm that came out of all of this 
was called Ethash. 
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So, the idea of Ethash like this, that we come, we scan all the block headers, so you do not have 
to scan it from the Genesis block but you can scan it from a checkpoint, and then we kind of 
hash those values, and you compute a seed. So, the seed is what creates the dependency 
between the previous blocks on the blockchain and the block that you want to mine and insert. 
So, this still involves ordered one amount of work. From the seed, we generate a cache of 
pseudo random data, which is typically 16 megabytes. So, you can generate this from the seed 


kind of run a pseudo random number generator, generate all this data. 


And from this, you generate a data set, for each this is not a subset, the data set uses this and 
the data set is large. So, I stand corrected, the data set is large, it is several gigabytes. And this 
is what makes it hard to run a parallel program or run this on an ASIC or run this on a system 
with a lot of memory because mining was made difficult. Say every 30 to 100,000 blocks. So, 
this keeps on changing, evolving in a certain sense, the data set is updated, so the data set is 
updated. But otherwise, it remains constant for, reasonably large chunk. So, the data set is of 


the order of gigabytes. 


So, if you would go back to the fields in a block, you would realize that there were two fields 
that we had discussed. And we had not, we have not said much about it, we just say that we 
use it in the proof of work these two when a mix hash and the nonce. So, what we do is that we 
take the header of the block, we compute a random nonce, we take a random slice of the data 
set. So, this random slice of the data set is actually the next hash, and the data set is this big 
thing that we generated. So, what do we do, we take a random slice of the data set and makes 


hash comes out of it. 
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We generate a random nonce, we take all three quantities, the header, the nonce, and the random 
slice, then what we do is we compute multiple rounds. So, we do these multiple times. And 
every time we compute a hash, and the final output is like a hash of hashes. It is a compressed 
digest. Where we have taken the data and we have kind of mixed it several times. In each 
iteration, we have, of course, use the same random nonce, but we have taken different slices of 
the data set. So, this is important to understand that 5 is essentially running Step 4 several times, 


the nonce remains the same for the entire process. 


But the random slice changes. So, by changing the random slice several times, keeping the 
nonce the same, you finally arrive at a compressed digest. So, those who are familiar with 
cryptographic and hashing algorithms, will realize that they also work in a similar way, which 
is that, if you keep on data, keep on mixing and kind of permuting it for several rounds. So, 
this is kind of similar in principle. But of course, the way it works is quite different. But the 


output is that finally we kind of have a master hash that I am calling the conference digest. 


So, how do you verify that this is indeed the proof of work, the way that you verify is the digest 
has to be <a threshold and the threshold is a function of is a linear function of 2?°° / difficulty. 
So, we have seen the difficulty before. So, now just look at it higher is the difficulty, lower is 
the threshold, the threshold is lower and the digest is kind of randomly distributed, it will be 


harder for any random nonce to lead to a digest where this will be satisfied. 


If the difficulty is very, very high, the threshold will be very, very low. And the threshold is 
very low, then the probability of finding a nonce whose digest will be between let us say zero 
and the threshold, that is going to be quite difficult. And the function is so complex, it is quite 
hard to kind of work backwards, where you fixed the value of a digest and work backwards 
and find out what is the value of a nonce. So, pretty much we have to guess random values of 


a nonce and you have to be lucky. So that is the key idea. 


So, the point is that the difficulty basically determines how long on an average it takes to mine 
a block. Mining a block means what? Finding the mixer hash and nonce, such that this condition 
is being satisfied. So, the way that the Ethereum algorithms Ethereum designers have coded 


their system. It is that the difficulty gradually increases with time. 


And the reason is quite simple. The reason is that over time computational capability increases. 
So, our processors get faster, we get more memory on chip, and our memory also becomes 


faster. So, the same difficulty that we had, we cannot have it again, otherwise it will become 
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very easy to mined blocks. And the easier it is to mined blocks, the harder it is to ensure that 


we still have the view of a single chain. 


So, that is why there is a need to keep the difficulty high. So, the difficulty also keeps increasing 
with time. So, the difficulty, in fact of a block is a function of the difficulty of its previous 
block. So, it increases. Verifying, as you can see is quite easy. Any nonce that leads to condition 
6 getting satisfied, which is this. Essentially satisfies the proof of work, we will have many 
such nonsense that, as long as we can find a nonce, and of course, we will have the mix hash 


value. As long as this is these holds, we know that we are done. 
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So, as you many of you know, would have seen in this case, the verification is kind of 


probabilistic. So, it is not giving any party any distinct advantage, unless, of course, it can. So, 


you can in principle, take many values of nonsense and run it in parallel. So, that of course, can 
be done. But the thing is that it is still so hard because for even running one instance of it, you 
need a lot of resources. And it is not possible to run either maybe more than two or three on a 
single ASIC because we will run out of area. But of course, if I have a supercomputer, I can do 
that. But then again, the algorithm is biased towards parties that have a lot of computational 


power. 


So, we are not seeing that, 51 percent of the parties have to be honest, you are seeing 51 percent 
of the holders of the 51 percent of the computational power has to be honest. Types of smart 
contracts. So, when we can have a lot of smart contracts, so of course, let me start from the 
bottom we could have cryptocurrency. Ethereum the nice thing is it has built in support for 


accounts. So, it makes it a default choice not only for currency, but anything which is tradable. 


We can use the internal proof of work mechanisms to generate random pseudo random numbers 
can easily be done with the mix hash. We can use it as a data provenance mechanism in the 
sense that we can use it to provide news feeds, current time and so on, and everybody would 
know that a given node has provided something and you can always, later on verify it. So, if it 
turns out to be wrong, we can kind of discard a certain node. So, the good thing with the smart 
contract and blockchain mechanism is that it is the responsibility of the community to enforce 
that contracts world contracts are executed, and kind of the world state is updated, honestly, 


subject to this 51 percent criteria. 
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So, now, sorry, I missed this slide on scalability. So, the issue is that the rate of transaction 
commit as we have seen in block in Bitcoin is quite slow. So, it takes 10 times 16 minutes to 
finalize the transaction in Ethereum it is in seconds, two minutes. But we can still make it faster. 
So, what we can do is we do not have to store all the blocks and transactions we can store 


checkpoints, we have seen that. 


We can periodically checkpoint the world state trei in the sense if I am a new node, I want to 
verify that the world state is correct. I do not have to start from the Genesis block, I can start 
from the previous checkpoint. Inactive nodes, nodes that are not active can be thrown out, you 
can of course create a hierarchical structure where we divide the entire network into sub 


networks each network compute its own limited chain. 


So, these are known as shard. So, this approach is known as sharding. And then these 
lightweight chains are finally joined. So, this is possible we can create a hierarchical setup to 
do it this is indeed possible with small level lightweight chains are made into a bigger chain. 
So, of course, all of this is possible. So, this has rendered Ethereum quite practical, quite 


scalable and quite popular as of today. 
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These are the references so the main references of course Ethereum project yellow paper. So, 
this is always a document in progress. So, currently our proof of work is kind of computational 
proof of work. So, Ethereum is moving to Ethereum to which will be proof of stake-based proof 
of work. So, once that comes again, maybe another video is due from my side. So, the yellow 
paper is basically for implementers. It is reasonably technical. If you want something lighter, 


you can go to the beige paper, which is an easier read. 
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Stellar Consensus Protocol 
Welcome to the lecture on the Stellar Consensus Protocol. So, in this lecture we will discuss a very 
different kind of consensus protocol. So, we have seen a lot of consensus protocols up till now, we 
have seen Paxos, raft, we have seen Bitcoin, Ethereum. So, stellar is different, it is more scalable, 
itis also new, so, stellar is a 2016-18 protocol. So, we will see it is much newer, much faster, much 


more complicated as well. 
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* Not suitable for open membership (any body can join) 
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*} + Bitcoin and Ethereum 
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So, as we have discussed, there are two kinds of blockchains that are permissioned blockchains, 
which are easy systems such as hyperledger, for instance. Then, we have permission less 
blockchain systems, such as Bitcoin and Ethereum, where you do not need permission. But, 
because of probabilistic or rather cryptographic guarantees, it is possible to ensure that there is 
only a single version of the chain that is accepted. In the case of a permissioned system, such as 
hyperledger, all the participants are known they have login IDs, so all of them can be authenticated. 


And if they turn malicious, some action can be taken. 


So, it is clearly not suitable for open membership as we have seen, so, it is not that anybody can 


join at any time. So, this is more like a corporate intranet kind of system. So, what makes 


blockchains really interesting is permissionless systems of which Bitcoin and Ethereum are prime 
examples. So, we can use them to implement cryptocurrencies, which is by far their biggest selling 


point. 
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So, the key point that all of them use, including any protocol which is tolerant to Byzantine faults, 
and that includes in this case, Ethereum and Bitcoin as well. Is that in the case of Byzantine faults, 
nodes can behave arbitrarily and maliciously. And furthermore, malicious nodes can kind of 
cooperate with each other, collaborate with each other, and send messages to confuse the rest. So, 
if you look at our classical fault tolerance result, which is there earlier in this lecture series, you 
will find that Byzantine faults if we have 3f + 1 nodes, we can tolerate at the most f Byzantine 


failures. 


And then also it takes a lot of time, it kind of takes n factorial time to still come at an agreement. 
So, still come at a consensus it takes that much of time, but nonetheless, with 3f + 1, the maximum 
number of Byzantine failures we can tolerate is f. Not more than that, because more than that the 
system is not guaranteed to complete. So, in this case, 2f + 1 nodes make a quorum. So, what does 
that mean? All of them need to be correct. And all of them need to agree on the same value, agree 


on the same consensus value, regardless of the rest. So, we can think of them as having a quorum. 


Even a classic centralized system, where if the centralized server is guaranteed to be honest, then 


all that you need to do is you need to send a message to the centralized system, get a reply. And 
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insofar as all the nodes are concerned, the centralized system is the quorum. So, what is a quorum 
in this case? A quorum is a set of nodes who have to accept your right and to provide you the value 
of a read. So of course, they may collaborate among each other to provide you the value of a read. 
But, at least if they are correct, you know that the entire system is correct. In the case of classical 
Byzantine fault tolerance 2f + 1 node, is the minimum that you should have for a quorum size, to 


ensure you are getting the correct answer back. 


(Refer Slide Time: 04:24) 


New Ideas: Stellar and Ripple 


Federated Byzantine Quorum System (FBQS) >) 
od 


* The participants are not known 
_. * There is no need for all the nodes to participate in a consensus protocol 
| * Every node chooses whom to trust (think of it as a private quorum) 


| « Such protocols are very modular and scalable 


NPTEL 


So, some new ideas have been proposed, such as stellar and ripple. So, they use a very interesting 
new concept known as federated Byzantine quorum system. So, Byzantine we know, quorum we 
know, what is federated? Well, federated basically means, like a federal in a federation of states. 
So, in a sense, it is like it is a hierarchical network or hierarchical setup that different groups kind 
of elect their value, and then ultimately, one of them is chosen. So, you can think of the fact that 
the entire network is broken into different subgroups, and each subgroup in a sense, elects in a 


certain sense its value. 


But, what are the key advantages? So, why have stellar and why not have others? Why are we 
discussing it? The biggest advantage of stellar is that the participants are not known. So, unlike 
many of the other protocols, which includes Bitcoin and Ethereum, fair if at least the participants 


are known, or the network is known or some multicast ID is known, where you can send it to all 
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the participants. In this case the participants are genuinely not known, they can join and they can 


leave at will. There is no need for all the nodes to participate in the consensus protocol, this is key. 


Because what was happening in Bitcoin, for instance, is that all the nodes had to participate, in 
fact, 51 percent had to agree, so, we needed a lot of participation. And only then was a block 
getting ratified, but in this case, that is not the requirement at all. This makes a heaven and earth 


difference. Furthermore, every node chooses whom to trust. 


So, think of it as a private quorum, in the sense that every node only knows a few other nodes, and 
it chooses that I will trust only those nodes and not the rest. So of course, there are some 
restrictions, this protocol is not as general purpose as ethereum, for instance. But again, there are 
advantages in the sense that this relies on local knowledge, it does not rely on global knowledge. 
You do not have to know all the nodes, you will know a few and you trust them. So, this is modular, 


scalable and ultra-fast. 
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So, let us first discuss some background and assumptions. So, let us see the universe of nodes is 
V, the set of all the nodes. A node is correct if it executes as per specifications, so, it is correct 
meaning it is correct. So, it is the dictionary meaning of correct, which means as per laid out 
specifications the node executes. If a node is either correct, or it just feels without sending any 


malicious messages, and node just fails. That is okay that is fail Stop, it is not Byzantine failure, it 
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is a failed stop failure, then the node is dubbed honest. So, so where are we? We are at the point 


where we say that a node is correct, which means it executes as per specifications. 


Or, we are saying that it can have a failed stop failure, but will not send any malicious message. 
So, then we will say that the node is dubbed to be honest, the rest of the nodes we are we say are 
faulty. So, stellar protocol can work in an asynchronous setting, it is just we cannot make certain 
guarantees. But, if it is a partially synchronous network, which is often the case that the clocks are 
loosely synchronized, we can make more guarantees as we will see. So, what happens is we assume 
periodic clock synchronization is known as a global stabilization event. And when the clocks are 


synchronized, it is a global stabilization time. 


All the messages have a bounded amount of delay and the clock skew is bounded, so, that is 
important. The messages themselves have a bounded amount of delay, a bounded degree of delay, 
and the clock skew itself between two clocks is bounded. So, this is a partially synchronous 
network and most practical systems are partially synchronous. So, there is nothing to be surprised 


about this, it is not a wild assumption at all. 
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A quorum 's a set of nodes. Every node that is a part of a 
quorum also has at least one quorum slice in it (in the quorum). 


Assume that for the time being all the faulty nodes do not lie about 
i) their quorum slices. 


So, what is the key idea? So, the key idea is that every nodes trusts a set of nodes. If this is one 
node, you create a quorum slice. So, this could be a quorum slice, this could be a quorum slice, so, 
of course, the node has to be a part of the quorum slices. So, you can say that in this case that there 


are three quorum slices. And the node is a part of all three, of course because every node has to 


trust itself, and the node can be a part of multiple quorum slices, as you can see over here. And a 
quorum slice informally is a set of nodes that the node trusts, and node, a node is a member of all 


of its quorum slices. 


So, this is very easy to do in the sense if you have a large network, you can say that a node will 
trust the its system administrator. And if there could be multiple system administrators, it could 
trust them. That could be one quorum slice or a node could trust the machines of some colleagues, 


or a node could trust the machines of some students. 


So then, as you can see, we have three quorum slices over here, we are not restricted to one. A 
quorum is defined as a set of nodes, so, a quorum is a set of nodes, a very important concept. Every 
node that is a part of a quorum will also have at least one of its quorum slices, so, let us say V is a 


part of a quorum. 


It has three quorum slices. Either, QS1 or QS2 or QS3 has to be a part of the quorum, so, either 
QS1 or QS2 or QS3, anyone of them has to be a part of the quorum. So, so what is the idea? The 
idea is that a quorum is a set of nodes, you take any node in the quorum. You will have at least 


one of its quorum slices, that is fully contained within the quorum. 


So, let us say, if this is a node and this could have the node could have multiple quorum slices, we 
are not concerned about that. But, there is at least one quorum slice which is fully contained within 
the quorum and this is the case for all the nodes, so this is a quorum. So, what does this mean? So 
intuitively, what does this mean? So, quorum slice basically is the set of nodes that a given node 


trusts, alright, so that is a quorum slice. 


Now, the moment we have a quorum and we consider every node in the quorum, so you consider 
this node. If one of its quorum slices is within the quorum, then at least what can you say? You 
can say that if the entire quorum takes a decision, if all the nodes within the quorum take a decision, 
then you know that at least one of its quorum slices also has taken the decision. And given the fact 


that the quorum slice is trusted, the node can use that fact to make a decision. 


So, this node can use this fact some other node, this node can also use this fact, because it will 
have its quorum slice at least one which is fully contained within the quorum. So, this basically 
means that if the entire quorum takes a decision, whatever be the decision in the case of consensus, 


for instance, then all the quorums, then all the nodes within the quorum can have some degree of 
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relief. The relief is that some set of trusted nodes which is at least one of their quorum slices has 


been a part of it. So, this gives a degree of sanctity to the decision which we will see to be very 


important. 


So clearly, in this case, the notion of the quorum is the most important concept. So, now I assume 
that for the time being, all the faulty nodes are not lying about their quorum slices, because faulty 
nodes are Byzantine faulty. So, they in principle could lie about their quorum sizes. Let us assume 


this is not happening, but even if it does happen, it is not an issue. 
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{v1, v2}, {v3}, {v4}, {v1,v2,v3,v4}, .... 
« 


So, let us give an example of quorums and quorum slices. So, let us say v1, v1 has one quorum, 
slice v1, v2. V2 has two quorum slices as you can see, it is a part of both v3 and v4. So, there is 
some trivial quorum. So, let us say v3 is a quorum in itself, because it is the only node, and v3 is 


its quorum slice, which is fully contained fair enough. v1, v2 is a quorum, why? 


Because v1 quorums slice v1, v2 is fully contained here. v2 has two quorum slices, one is not fully 
contained, but v1, v2 is fully contained. And then of course, the universe of all the nodes is also a 
quorum, because all the quorum slices will definitely be in it. So, as you can see for a system with 


different quorum slice definitions, multiple quorums are possible. 
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Example of Quorums and Quorum Slices 


Node Quorum Slices 
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1) We can have many quorums 
{v1, v2}, {v3}, {v4}, {v1,v2,v3,v4}, .... 


So, now we need to come at some basic requirements that we defined what is a quorum slice? 
What is a quorum? But, now let us see when is it that you can guarantee consensus in this FBQS, 
Federated Byzantine Quorum System argument. When is it that consensus can be guaranteed? This 


is by and large, the most important requirement. 


So, as I have said over here, you can have a lot of quorums. So, it is saying you take any two 
quorums Q1 and Q2. All pairs of quorums have to have a non-empty intersection. Furthermore, 
one of the nodes within that intersection has to be correct, so what is the idea? The idea is you take 


all the quorums in the system, intersect them, the intersection has to be non-null. 


And furthermore, in the intersection, you need to have a correct node, which means they need to 
intersect at the correct node. So, we will discuss the insights later. But, the most important point 
here is that when you are designing a system, in fact, the most, the trickiest aspect of designing a 
stellar network is to basically choose these quorums, is to basically choose these quorums such 
that this condition holds. And furthermore, let us say I create quorums and then there is an 
intersection. How do I know whether the three nodes that are intersecting? The all out of those 
three at least one is correct, because I may not have control over the runtime system, well, that part 


is correct. 


If I do not have control, of course, I will have to go to other algorithms the way that we have seen 
in Ethereum, Bitcoin and others, which do not have this restriction, when many a time, you would 
have some nodes which are more fault tolerant than others, for example, the machine or the system 
administrator and so on. And if we have maybe four nodes in the intersection, we have to look at 
probabilities. So, the probability of all four of them developing a Byzantine fault might be very 


very low. And this is something that we want to hedge against while designing our quorums. 


So, what will the correct node to secure is a fun part right. So, the correct node will essentially 
ensure that there is a common agreement or all the, or consensus across all. I use the term quora 
here for, the plural of quorums, but I should use quorums because that is what we have been using. 
So, the key point is that if there is one correct node in the intersection and if let us see, one quorum 
has decided something. So, what is the advantage of a quorum? Quorum decides a consensus value. 


Now, if every other quorum in the system has an intersection with it at a correct node. 


Later on, let us say another quorum tries to decide something, or tries to decide something which 
is the opposite of what Q decided. If Q’ tries to decide something, which is which goes against 
what Q has decided, then, the correct node at the intersection is actually going to stop it. You can 
never rely on faulty nodes, but the correct node is going to stop it, and it is going to say that look, 
a quorum has already decided something and you are going in the opposite direction. So, do not 
go that way and the same holds for all the other quorum. So, that is why quorum intersection is 


important. 


So, in other words, in this case, all that you have to do, all you need to do is to convince a quorum 
to accept a value, then you are done. The quorum is small, you do not have to broadcast the 


message to all the nodes, all the nodes will ultimately get it. Few more definitions. Do nodes v1 
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and v2 are set to be intertwined, if they are both correct, so that is important. Whenever we are 
talking of nodes with some decision-making capability, both of them have to be correct that is 
necessary. Furthermore, every quorum that contains v1, intersects every quorum that contains v2, 


in at least a correct node. 


So, this is slightly providing, this is slightly specializing the definition to a pair of nodes that look 
two nodes are intertwined. Number one if they are correct, and number two, if let us say every 
quorum that contains v1 will intersect every quorum that contains v2 at a correct note, which is a 


specialization of the definition. But, we will see why it is being done like this. 
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Some Basic Requirements 


* Every two quorums must intersect at a correct node. 


This correct node will ensure that there is common agreement (consensus) 
across all quora. 


Few more definitions: 


Two nodes, v1 and v2, are said to be intertwined, if they are both correct 
and every quorum containing v1 intersects every quorum containing v2 
G in at least one correct node. 
. 
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That does not define an intact set. So, the definition of an intact set is quite crucial, for the rest of 
the paper as well as for the proofs. So, what we do is let us start with a Federated Byzantine 
Quorum System FBQS S. And let us project it to set I, so let us see what that means. So, informally, 
what it means is, we consider only those elements in S|I > S|I(v) = {q N I/q € S(v)}. so for every 
vertex, I take a look at all the quorum slices, so SV is basically all the quorum slices. So, you 


consider each one at a time. 


See, if q is an element of this, I only consider those elements of each quorum slice that are also an 
element of I. Or basically, I compute an intersection between q and I. So, basically what I do is 
that if let us say this is my full set S, and I have this I as the set on which I am projecting. So, what 
I do is I take a look at every vertex v, and for this vertex v, I take a look at all of its quorum slices. 
Let us say the vertex v is over here and these are its quorum slices. I only consider that part of the 
slices which lie within I, so this is the projection. So, a set I is an intact set in a such kind of a set 


Lis an intact set, if it follows two properties. The first is I itself is a quorum in S. 


So, what we do is that we take the Federated Byzantine Quorum System S, and in that let, us say 
I choose one quorum, and that quorum is I. So, I call that I as an intact set if all pairs of vertices 
within I, let say this vertex and this vertex are intertwined, are intertwined. So, go back to the 
previous slide to understand what intertwining means, which means both the nodes are correct. 


And they are intertwined in S projected to I, which basically means that there is some quorum slice 


897 


of this which completely lies within I. And there is some quorum slice of this which completely 


lies within I. 


And the fact that they are intertwined basically means that they intersect at the correct node. So of 
course, the way that I have drawn, they are not intersecting but you get the idea. So, the idea is that 
if I consider if my new universe, everything is limited to the set I, and for every quorum slice, if I 
only consider those elements that are a part of set I, then I will consider I to be an intact set, if I 
was a quorum in S, so that is the first condition. The second is if I project S to I, in the sense I only 


look at all the nodes and quorum slices of S that belong that have a nonzero intersection in I. 


Then, any two pairs of nodes in I are intertwined in this projection. So, this basically means that 
this is kind of a self-contained set, that is the reason it is being called an intact set. It is it is a self- 
contained set, where if I take any two nodes and I basically take their quorum slices, then there is 
an intersection at a correct node. And of course, these completely belong within I, so, it is like a 


self-contained universe, so this is an intact set. 
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So, let us understand a little bit more and look at a few theorems with regards to intact sets. So, 
why do we look at intact sets in the first place? The reason we look at intact sets in the first place 
is basically because intact sets can reach a consensus, using the stellar algorithm. Independently, 


intact sets can reach a consensus. And of course, if the entire network is an intact set, then the 
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entire network can reach a consensus. So, the first theorem is like this the, let us consider a quorum 


U1, and a quorum U2, and let us say both of them intersect with the intact set with an intact set I. 


If that is the case, then (UI NM U2) MN 1+ @. So, this is an important result that is used in the proofs 
of the stellar protocol. So, I would encourage all of you to take a look at the papers and understand 


the proof of the protocol. 


But, the basic idea is that if let us say there are two quorums, which intersect the intact set, then 
the intersection of those quorums also intersects the intact set, so the intersection is not disjoint. 
Furthermore, if I take two intact sets, for instance, this is one intact set, this is one more intact set, 


and let us say that they are intersecting. 


So, they are not separate, these are intersecting, then, they are closed under union which basically 
means that if there is a common node between them, then the union of intact sets is also an intact 
set. So, this basically means that in any kind of a network unless a node is totally disconnected, if 
you have intact sets of this type, then essentially all of this is an intact set and all of this can reach 
a consensus. So, basically the only intact sets that will not reach a consensus, along with the rest 
of the intact sets or if something is totally cut off. So, what does it mean to be totally cut off in the 


sense? It does not have any intersection with the rest of the intact sets. 


So, that is when this is totally cut off. Because, what we also get to see is that by the first theorem 
over here, let us assume that there is some other intact set like this, such that (/1 N/2)#o # > (/ 
1 U /2). So, you know that part is clear that these two have to intersect, and if they intersect, the 


intact set will further grow. 


So, that is the reason either you have a separate kind of disconnected component over here, which 
does not intersect any intact set. So, so then, of course, it is separate, otherwise, if there is any 
intersection, you can consider this to be a larger intact set. And that can independently reach 


consensus as per the stellar algorithm which is something that we will see. 
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Let us understand a little bit more. 
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So, let us now look at the idea of non-blocking consensus. So, I will discuss what is non-blocking 
over here, but basically in the presence of nodes that could have Byzantine failures. So, let us look 
at this for an intact set, and given the fact that we have discussed this union and intersection 


business over here. 


Let us consider a maximal intact set I, where clearly, we cannot go it further, in the sense, there is 
no other intact set with an intersection such that we can grow it. So, the scope of growing it is not 
there, it is a maximal impact set. So, what we say is that these four properties will be satisfied by 
Stellar, and in fact should be satisfied by any consensus algorithm that deals with such kind of 


intact sets constructed in this manner. 


So, what are the properties? The first is integrity. So, integrity basically means that no correct node 
decides twice. So, this follows from the definition of consensus that if you are a correct node, once 
you have said that this is the consensus value that remains, you do not change your mind. 
Agreement again follows from the definition of consensus, it basically means that no two nodes 
decide differently. Of course, weak validity, which basically means that if all the nodes are honest, 
then the value that is decided is one of the proposed values of course. So, in this case, the idea 
basically is that if we are looking at honest nodes, then of course one of the value has to be one of 


the proposed values. 


So, this again follows from consensus. What does not follow? But, what technically makes sense 


is that let us say there are no malicious nodes, or all the malicious nodes just stop. Then, the 
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protocol is guaranteed to terminate because in this case, this will not be against the FLP result. 
Because, what does the FLP in a result or protocol say? We cannot guarantee termination if any 
malicious or faulty nodes are active. But, the point is that in this case, malicious and faulty node if 


they are not active, you should be in a position to guarantee termination of the consensus algorithm. 


So, the first three kind of follow from consensus and the fourth one follows from the fact that you 
would like to ensure our minimum amount of liveness in the system. So, we have these four 
properties, which are you the consensus guarantees, you would like to have at least for maximal 


intact sets. 
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So, the key part of the algorithm is federated voting, which draws inspiration from two phase 
comment. So in this case, we have two phases, so it is called vote and delivered, but later on, we 
will see it is also called prepare and commit, it does not matter. There are two phases. So, I would 
like to request the viewers to look at two phase commit, to also look at the Paxos algorithm, 
because that also has two phases. And it is quite similar to this part, where you vote and you 
deliver. So, for the correct nodes, federated voting would like to ensure these four properties, so 


this is also known as reliable Byzantine voting. 


So, the reliable voting basically means what? So, in this case, mind your voting is not consensus. 
Voting is a process, which provides some safety guarantees very little liveness guarantees, but let 
us nonetheless see what it is. So, the idea is that nodes vote for a value. And let say if a quorum 
successfully votes for it, then it is delivered, so, there is no duplication. In the sense, every correct 
node would deliver at most one voted value. So basically, as I said, delivery is the second phase, 
voting is the first phase. So, in the second phase, you will deliver at most one voted value, not 


more than that. 


Then, you have totality, if a node in I, where I is an intact set delivers a voted value, every node in 
I delivers a voted value. So, basically what does that mean? It basically means that you have, so 
this is a liveness guarantee, which basically says that look if one of the nodes. In one of the correct 
nodes in an intact set terminates and delivers a voted value, then all the other nodes also will 


terminate, so, this is a liveness guarantee. The first one is of course, a safety guarantee. So, 


consistency basically means if two intertwined nodes which is like any two nodes in an intact set, 


because all pairs are intertwined. 


If they deliver a and a dash respectively, then a is equal to a dash, in the sense it delivers the same 
value. So, this is just a sophisticated way of saying that all the nodes in an intact set deliver the 
same value. So, the way that it is written because I have kind of paraphrased what is there in the 
paper, that if two intertwined nodes delivery a and a dash, then a is equal to a dash. So, which 
basically means that once a node delivers, everybody delivers, and everybody delivers the same 
value, that would be the sum total of 2 and 3. So, this is like what you would like your federated 


voting algorithm to ultimately satisfied, and then of course, we have validity. 


So, validity basically means that, so this again is this was again, you know a safety condition. So, 
this is one more safety condition the last one, which essentially says that look, you deliver only 
what you are voted for. So, you do not deliver something else, you deliver only what you are voted 
for. So, if you look at it, so no duplication was what that on only one of the values, once node 
delivers, it cannot kind of vote for any other value, or it cannot also deliver any other value. So 


then, this was a safety condition, the last one as we have just seen as safety condition. 


And the third one is also a safety condition basically, because it says that all the intertwined nodes 
in an intact set which is all the nodes, deliver the same value. So, the liveness condition is basically 
like this that if one node delivers, everybody delivers is only liveness condition. So, what we see 
over here is that federated voting is again taking some aspects of consensus, it is again, of course, 
it has a different liveness condition as compared to this slide. But, the idea basically is that the first 
slide was consensus for intact sets. And this is a property of federated voting, which will ultimately 
take us towards consensus for intact sets of federating federated voting, as such is a step. So, Fv is 


a State step, and Fv will take us towards consensus. 


So then, what do we have now? So, what we have now is let us look at the federated voting 
protocols. As I have said, it is a two-phase protocol. There first you vote and then you deliver. But 
again, you are guaranteed to deliver only a single value, and the entire intact set either delivers or 
does not deliver. So, you have this liveness guarantee, and if it delivers, it delivers the same value, 


otherwise, it does not deliver at all. 
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And of course, if you in the rare case where everybody votes for the same value, it is only that 
value that gets delivered. So, the way that you need to think about it is that it is a two-phase 
algorithm of vote and deliver, where the voting process may actually not successfully complete. If 
that is the case, nobody delivers. But if it successfully completes, then everybody is ultimately 


guaranteed to deliver the same value. 
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So, this is essentially the crux of federated voting, which the algorithm on this page will convince 
you that it works. So, what we do is that we do a (v € v, t € Tag) essentially indicates around, 
because the way that the master protocol works is that we have many rounds of voting. We do not 
have just one round of voting, but we have many many rounds of voting, and the tag indicates the 
round of voting. Then, we have this state variable voted, ready, delivered all start with false, so let 
us start with the vote. So of course, if any node wants to vote for a value, it can initiate, or we will 


see that there are other conditions also. 


But, at least let us say that a node decides to initiate that it needs to vote for a certain value. Say 
for that specific round, for that specific tag actually, if it is not voted, then it will vote only one. So 
basically, the idea is that for a given tag, so I will use tags and round interchangeably. So, for a 
given tag or round, you can vote only once. The moment you vote, the variable voted becomes 


true. Then, you send the vote message to every vertex, including yourself in the vertex set. And 


what would that include? That would include the tag number and that would include the value that 


you voted for. 


And mind you, you are allowed to do this only once, not multiple times only once, because of this 
variable over here. So, the moment you receive a vote t, a message, it is a moment node v. So, here 
node v is always a node that is processing either sending or receiving. The moment it receives a 
vote t, a message from every node have a quorum, that it is a part of right. So, here the idea of a 
quorum is coming, so, you can go several slides back. But, the basic idea of a quorum was that 
every node in the quorum of course belongs to the quorum, and one of its quorum slices definitely 


belongs to the quorum. 


So, if a quorum votes for v, and of course we have a quorum intersection and so on. If it is not 
ready, so again, we have a ready variable, if am not ready, which means ready is false. I set ready 
to true, and then they I send a ready message to every other node including myself. So, go back to 
two phase commit. In two phase commit and Paxos, we were doing something quite similar. So, 
the first round you send vote messages, once you get vote messages from a quorum, so once you 
get these vote messages from a quorum. Then, what you do is that you begin the second phase and 


then you send a ready message to everybody else, including yourself. 
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In the second set, so so here is also a fun part that the ready message is also sent. If a v-blocking 


— 


sets, I will describe what is a v-blocking set. So, let us say this is vertex v, and these are its quorum 
slices. A v-blocking set is a set which overlaps with every quorum slice of v, so it is like this, this 
is a v-blocking set. So, if I am a node v and a v-blocking set, which as I have said these are my 


quorum slices. 


So, v-blocking set pretty much overlaps with every quorum slice in a sense, it has a nonzero 
intersection. So, if it sends a ready message to me, so if it sends a ready message back to me and I 
have vertex v, then what do I know? What I know is that at least, for all my quorum slices, one of 
the nodes has changed his status from not ready to ready, and it will not change its status ever 


again. 


So, from not ready, it has become ready which means it is done. It has made up its mind, it has 
voted as well as, as well as it has changed its status. So, then whatever ready message it sends, 
remind you this may be different from what you originally voted for. It may be this value of a may 
be different, but that is okay. It means that whatever I originally voted for, did not find adequate 
support. But, now given a v-blocking set, which means pretty much one entry from every quorum 
slice of mine, if that is committed, it has changed its state from not ready to ready. So, that will 


remain it is kind of final. 


So, what I will do is if I am not ready again, because I changed my state only once. For a given 


round, I changed my state from not ready to ready only once. So, the moment I do that, if Iam not 
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ready, I become ready, and then again I broadcast the ready message to every vertex. So, what you 
see over here is that a ready message can be propagated, of course, the first node to send a ready 
message has to receive votes from a quorum, so that is given. But after that, a ready message can 
be propagated if a v-blocking set has sent the same ready message to a node v. And in this case, 


of course, it can be different from his voted value, but it will still propagate it. 


And then what happens is that for node v if it receives ready messages from a quorum, which 
means a full quorum is sending a ready message again similar to two phase commit. If it has not 
delivered, then it will say delivered to true and deliver the message. So, this is basically what 
delivery in an intact set means, which means it has passed both the rounds. So, the first round is 
not guaranteed to terminate mainly because it is possible that an entire quorum may not agree, you 
have a 50-50 situation or even worse, when multiple values are being proposed. So, that is why I 


said the first set, the first round is not really guaranteed to terminate. 


And we will see timeout mechanisms for that, but assuming it does, then of course, you can 
guarantee beautiful things, and what you can guarantee is that the same value is delivered, number 
one. You can guarantee everything that federated voting guarantees, no duplication, you can 
clearly see that you deliver only one voted value, because only once your state changes from not 
delivered to delivered. If a node in I delivers a voted value, it is then it is guaranteed that it has 


support of a quorum. So, no other value can be delivered because of quorum intersection. 


So, you will you take any two quorums, and you will have one correct node that is in common and 
that cannot lie. So, given that it has already committed to a given value, it will keep that 
commitment, so nobody else can deliver anything else. So, that proves property three as well. And 
furthermore, if all the nodes vote for a single value, then that has to be delivered, because if you 
see our algorithm, there is no way to introduce any other spurious value, extraneous value in the 
middle. So, this is like a two-phase commit for as I said the first phase need not finished, but if it 


does, then you have this beautiful safety and as well as liveness properties of federated voting. 
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Now, let us look at some of the insights. So, the insights anyway, we have discussed, but still I am 
repeating them. The first ready message is received, only after the same VOTE message or the 


vote message for the same value. It received from an entire quorum, this is received from a full 


quorum, so only when a full quorum sends a ready message to a single vertex v. 


And then, so then only. So, when a full quorum sends a VOTE message for the same value and a 
vertex V receives all of that, then, it initiates the ready message. And then of course, it is 
propagated by other v-blocking sets. So, via this line over here, it is propagated. Termination is 


not guaranteed, because if it would be that would violate the FLP result. 


And in particular, the first phase is the one that causes the problems primarily. However, once a 
message is delivered, what does that mean? What that means is that a full quorum has voted for 
the message, it is not going to vote for any other message. So, that is being ensured with this line 
with this. So, if it is not going to vote for any other message and no other message can be delivered. 


So, we are at least assured that it is only this message which is going to be delivered. 


And for all the correct nodes, they will get the delivery in finite time, if we assume bounded clock 
skew. So, we assume that we have a partially synchronous system and all the nodes are active in 
the sense executing this algorithm. Then within a finite amount of time, which can also be made 


bounded also. If one message is delivered all the messages will be delivered to the intact set. 
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So, let us now take this to the level of a ballot. So, let us now create a consensus protocol or rather 
a blockchain out of this. So, what we want is that we want to use federated voting as the primitive, 
and then we want to build on top of it. So, let us introduce the notion of a ballot, where ballot is a 
tuple of a positive number, which keeps monotonically increasing. So, this is the round number. 
The tag that we discussed and x is the value, value that is being voted upon. So, we see that b <b’, 


if either the number b.n < b’. n, or if the numbers are the same, but the value is less than that. 


So, this the value is used as a tiebreaker only when the numbers are the same, otherwise we go by 
the number. So, two ballots are set to be compatible if their values are the same, so that makes 
sense. This is the symbol that we have for compatibility and two values are not compatible. Again, 
this symbol if the values are not the same. So, the term that we will use assume that b < b’, in the 
sense the numbers are less than equal to that. So, say it will follow the same notion of equality and 
less than over here. So, if we have this, and the ballots are incompatible, so it is less than equal to 


this and the ballots are incompatible, we write b ¢ b’. 


So, hopefully this is clear that if the ballots are not compatible with each other, then we basically 
write that, and let us say is less than equal to the other, then we write it in this fashion. So, actually, 
we can write less than as well because if the ballots are incompatible, we will never have strict 
equality. But, we could have equality in terms of numbers, but the values will not be the same. So, 


that is being captured with the less than or equal to over here. But nonetheless, the term is clear to 
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the moment we have something like this, it is essentially indicating an incompatibility. And along 
with that, it is also indicating that 1 < the other, in terms of these numbers that we have defined 


over here. 
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So, now let us discuss an abstract consensus algorithm. The reason I am using the term abstract is 
basically because it assumes infinite resources, which is not the case. So, we will discuss a version 
of this algorithm that uses finite resources. But, at the moment, let us proceed with the version that 
uses infinite resources. So, let us say that specific to each ballot, we have a federated voting 


process. So, given the fact that we do not place any bounds on this array, in principle this could be 


infinite, so that is why I say it is an abstract algorithm, or a theoretical algorithm. But, very soon 


we will place bounds on it. 


So, an array of federating voted voting processes are in ballots. They maintain two state variables 
candidate and prepare initialize them to 0 and null. And of course, the first round is round 0 and 
this keeps on increasing monotonically. So here again, we will have two phases, so we have taken 


a lot of inspiration from the two-phase voting protocols. 


So, here also we will have two phases in the ballot algorithm, so this is the consensus algorithm, 
so we start with a candidate. So, the candidate is basically a tuple of round one, and the value x 
that is being proposed. As you can see, it is the same value that is being proposed, so, here is what 


we need to do. 


So, here is the fun part of where the federated voting aspect is being used as a function. So, for all 
the ballots for all the values of b’, which are less than an incompatible with candidate, we have to 
ensure that everybody agrees that you vote for false. Say, vote for false basically means that all of 
those ballots and their values are voted to be false, in the sense those ballots everybody agrees that 


we will not accept it, or this will not be our consensus value. 


So, you are essentially voting false on them. So, this basically means that whatever ballots are not 
compatible with your candidate, and which are which have a lower number, you first invalidate all 
of them, and then only we try to validate this candidate, otherwise we do not, so this also makes 


sense. 


So essentially, you are cutting out the possibility of any candidate which is which has a lower 
number than you, which is less than an incompatible to you. Because, it is compatible, you do not 
care is the same value. But if it is incompatible, you are cutting out the chances of it ever being 


chosen as the consensus candidate, because we are waiting for all of them to vote false. 


Let us now look at the second part. So now, what have we done? We have said we are given an 
order that all the ballots which are less than and not compatible with the current candidate have to 
be invalidated first. So, when all of them are invalidated, so of course here I am changing the 


connotations. 
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So, when so b is always our current candidate and b dash is always the other one. So, in that sense, 
it still remains the same, but of course, it should be clear that these are separate algorithms. So, 
when every b’ which is lower and incompatible with b, its ballot delivers a false, which means that 


it is successfully invalid invalidated. 


So again, as I said, this process is not guaranteed. Because, if this process was guaranteed, we 
would have used this as our consensus algorithm. But, once all the lower and incompatible ballots 


are successfully invalidated, and we have not prepared a ballot which is as high as the current one. 


So, that is also important that it is not as high as the current one, because b is always the current 
ballot. So then, what we do is that we set that to prepare, so this is the same as if not voted, then 
vote, it is the same logic. And what we see is that if let us see, so this is prepared. So, b in this 
case, is basically a ballot less than which we have received all the false messages. So, the point is 
we set that to prepare, because that is the largest value of the ballots that we have prepared. So, 
what is preparing in this case mean? Preparing in this case means that all the ballots which are 


lower and incompatible, all the b’s have been invalidated. 


So, once we have such a b decided to prepare if that is the largest such value, and then we look at 
the relationship between candidate and prepared. So, as long as we have prepared a bunch of 
ballots, which are at least equal to the candidate or greater than it, then we are fine. Then, it is a 


valid candidate, which means the candidate is good to go. 


Because, if this is the level of the candidate the round and if this is the level of prepared, and if 
than less than prepared, everything which is incompatible has been invalidated. Then, it also means 
that anything less than a candidate also has been invalidated, and basically the candidate is good 


to go. 


So, we said candidate as prepared, we set it as prepared, because it is of course, you are assuming 
that the same value is being proposed, so candidate and prepared are compatible. And the reason 
that I say that is basically because the same node will only propose a single value in this case. And 
if let us say less than prepared, everything is invalidated, less than candidate also holds. So, we set 
candidate as as prepared, say the sense we kind of boost its rank up and set prepared, take the value 


of prepared and set that as candidate. 
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And then I set ballots candidate dot vote true. So, what does this mean? So, what this basically 
means is that once I am sure that all lower and incompatible ballots are all invalidated, then I try 
to convince everybody that look, here is a candidate and you vote true for it. So, this kind of begins 


the second phase. 
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So, when all the ballots have delivered a true in the sense all of them agree on the candidate, we 
decide that it is true and this finishes the consensus, so, this means consensus has been achieved. 
So, once one true message is delivered from earlier theorems, we know that the rest will also be 


delivered because of the property of intact sets. 
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So, we do not really have to wait for all the delivered messages to reach their destinations, the 
moment we get one we declare victory, and we see consensus has been achieved. So, as you can 


see, we use federated voting, and even the top-level algorithm is also federated voting. 


So essentially, it is basically a two-level hierarchical federated voting, where our aim is to kind of 
cancel out all the ballots with a lower number which are incompatible. Once they are all cancelled 
out, then what we do is we say that look here is the candidate. So first, we vote false with this is 
canceling, this process is canceling, and then we vote true which basically means now all of you 
agree on this candidate. So, as I said, the process of federated voting is not guaranteed to complete. 
So, it is not guaranteed to complete which basically means that there could be the first phase could 


get stuck. 


So, that is why we need a timeout mechanism. So, what we, so here is what we do? Assume a node 
v and a quorum U where v is a part of the quorum. If let say for all nodes in the quorum, so all 
nodes in a quorum that v is a part of, there exists a ballot bu, such that. So, bu is a ballot which is 


associated with u. And what is U? U is a node in the quorum, so that we do this. 


This is the quorum, this is a u with it, and associated with it is a ballot bu. So, there has to exist a 
ballot bu, such that the current round is less than the round of the ballot, so, bu. n. so, we have 
either received a vote or ready message for this ballot is a true, or we have received a vote or ready 


message. But, for some ballot b’ which is false, where b dash is defined as follows. 


So, which basically means that there is an open interval that ends at bu, in a sense this is BU, and 
these are these are all the ballots which are less than it. So, there is an open interval in the sense it 
does not contain bu, but it contains all of these. So, as long as for each of these, so for each of 
these, so each of these iterated is b’. So, as long as for every b’ which is a part of this open interval, 
we have received a false, in the sense that we this has been invalidated. See if that is the case, then 


what we do is we realize that we are in trouble. So, why do you realize you are in trouble? 


When you realize you are in trouble because a ballot exists, whose current round is high, which is 
higher than the round of the current node which is v and v is a part of bu. So, it has clearly seen 
other ballots whose current round is quite high, that is the first point. And the other is it has also 


received messages from them, to either confirm the correct ballot or basically a prepared message, 
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so there is basically for all the ballots which are lower than it. It has received, prepared, you know, 


these prepare messages, so vote ready are part of that repairs. I will show you where once again. 


So, they are part of this, so vote and ready are part of this federated voting process. So, as a part 
of this federated voting process, it has received a lower number ballots from an open interval which 
is less than bu, saying that look, all of you are false. So, either it has received a vote of true or it 
received a vote of false for these, which basically means that there is some ballot which is active 
at a higher round. And all the other ballots in the quorum are higher. They have also been sending 


messages and we have gotten those messages, which means that Iam somehow left behind. 


So, what is the sum total of this argument? The sum total of this argument is that I have been left 
behind. So, I am behind other nodes, I am left behind. So, what I do now is that I increment my 
round. So, I take all the nodes in the quorum and I take the minimum n value, which is a minimum 


value in their ballots that they are sending. 


And IJ set that to my current round and I also started a timer, so this is the timeout mechanism that 
I was talking about. But, what you do is you realize that given that all the nodes are not making 
progress at the same rate. a node basically realizes that it has fallen behind, mainly because the 
round numbers that all the nodes in the quorum including it are using, is much higher than the 
value of the round state variable. It is round state variable, so, that is why it increments its round, 


and starts the timer. 
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for all b' candidate do ballots[b' vote false) number are voted out 
Vv 
when for every'b pallots{b']. delivered{false) and prepared <b) 
prepared €{b) 
er if candidate < prepared 
) candidaté’€ prepared 
ballots{candidatel vote(trif) pe, 


Ensure that the 
candidate ballot is 


accepted by all 
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& 
Consensus Algorithm — II 


when ballots(b].delivered (true) Once, one true message is delivered, we 
decide (true) know that the rest will also be delivered 
Consensus (property of intact sets). Declare victory 


— 


; . : A quorum is pretty much 
There exists a quorum U (vis a part of it) s.t. for aye 0) Baciiig for encthns 


There is a ballot ) (associated with u) s.t. 
1. round < {b,. 

2. Either, {VOTE,READY} (b,, true) has been received from u OR by) 
received {VOTE, READY} () false) from u for every b’ € [Z,, by), 2y < by ae 


¥ Fok, 
() round © min buh EU} A Upgrade the round to be in 
sync with the quorum 


wer <3) start_timer _, 
So, timeout what happens? So basically, you can see the paper, so in detail they have explained 


round. 


what it means. What is the connotation of the timer? So here of course, so then, what you do is 
that if prepared is uninitialized, then you start is this similar to the proposed function, so, I will 
show you the proposed function. Here is what you do. You vote on a candidate is you basically 
prepare. So, you increment the round, round + 1, and with candidate’s value, you start. Otherwise, 


if you are already prepared, then so as I said, it is never the case that you propose different values. 


So, whatever value you prepared with prepared. x, you just increment the round number and set 
that as the current candidate. Then as you can see, the second line is exactly the same as this line 
over here, which is you go forward and invalidate all the lower numbered ballots. So, if I were to 
summarize here is the key idea, you take all the lower numbered and incompatible ballots, you 


invalidate them. 


And then once that is done, you try to make everybody agree on the candidate, that is the idea. 
And if let say for some reason, the protocol gets stuck which it can, because anyway, federated 
voting liveness is an issue, nothing you just set a timer. So, the timer will basically ensure that the 
system stabilizes in the sense that all the messages we have sent are received. And after that, you 


just increment the round and start afresh. 
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Some Basic Properties yd 


J 


* If some node decided on a ballot, some node must have prepared it 
* If anode has prepared a ballot, some node must have proposed the 


value «y) 
* All nodes in the intact set decide the same value (if the protocol 
terminates) i 


* If quorum intersection holds, this is obvious. CO.) 


0 We need infinite state because the state of all the ballots and tags 
G :) (rounds) has to be maintained. 
° 


NPTEL 


Abstract Consensus Algorithm for node v 


> 


ballots € array of FV processes (specific to each ballot) — 
candidate, prepared € (0, p) — oe iq Xx 
_ x 
round € 0 Candi Ra i 
propose(x) 4 Ensure all incompatible 
candidate € <1,x> ballots with a lower 
Ps nlm all b' ¢ candidate do Siam : vote(false number are voted out 


when for sates cs ].delivered(false) and prepared <b) 
prepared © b 
G *) if candidate < prepared 


were. candidate prepared 
te) Pa 


ballots{candidate].vote(tr 


Ensure that the 


candidate ballot is 
accepted by all 


So, here are some basic properties of this algorithm. So, the basic properties would be like this, if 
some node decided on a ballot, it means some other node must have prepare it, there has to be the 
case, because after all, there has been a flow of information between the nodes. So, you see the 


algorithm, you cannot decide a value out of thin air. 


So, if a ballot has been decided, it must have been prepared. Now, if a node has prepared a ballot, 
some node must have proposed the value. And, again, this can be the same argument that we have 
a continuous information flows. And someone must have proposed he value. All the nodes in intact 


set decide the same values that this we have been discussing for quite some time now. 
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That it is not possible for the nodes in intact set to decide some other value, and this is a direct 
result of the federated voting protocol, that because of quorums and quorum intersections and so 
on. Given, the fact that all quorums intersect at least the correct node and the correct node can only 


decide once, and it for the same round. 


I mean, it actually cannot prepare different values, so that is not possible. If you take a look at this 
algorithm over here, so basically the key point here is actually quite clear that for any node, the 
information flow is maintained. And so, I am not discussing the timeout et-cetera in fact, but given 


the fact that quorum intersection holds. 


It is quite obvious that a single correct node that is there, that will not give confusing messages to 
the both the quorums that it is a part of. So sadly, even though this algorithm works, we need an 
infinite amount of state, because the state of all the ballots and all the tags or rounds has to be 
maintained, which is not in the best interest of the algorithm designer. Because, it will lead to an 


unbounded amount of state possibly infinite. 
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Finite Version of the Protocol 


* The main idea is that we cannot maintain an Ghbounded amount ot) 
state: all tags (rounds) and all ballots a ul 


* We need to do some garbage collection (dynamic removal basically) 


* We need to maintain a subset of all messages (and throw out of 
messages that are beyond the range) 


New Terminology 


(@) a 
* 


2 phate 


So, we have to work on a finite version of the protocol, for the idea is that we do not maintain an 
unbounded amount of state, instead, we have minima and maxima and so on. So, we remove 
entries, there is some garbage collection. And also, we do not maintain all the messages and all the 


state, that we maintain a subset of the full state or a subset of the full set of messages. 
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And anything that we feel is beyond the range in the sense should not be maintained, or can be 
inferred, that is removed. So again, we will have some new terminologies as I said, terminology 
keeps changing, but thankfully, it still remains a two-phase algorithm, fair. So, last time we had 
prepared and decide in this case, we will have prepare and commit, same thing. And so basically, 
we have used many terms would deliver, and now it will be prepared commit. But again, in a 


prepare commit is to kind of refer to the finite version of the algorithm, same concept. 
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New FV Algorithm (finde) * py 
prepare (b) ) 


if (max-voted-prep < b) then 


max-voted-prep € 
send VOTE (PREP max-voted-prep) to all the nodes 
— Se 


el 


if there exists a maximum ballot b)\> max-voted-prep) and if every node, u, in the quorum 
(containing v) sends VOTE (PREP b,) where b' ¢ b > b' ¢ b, 
max-readied-prep —s om 
send READY (PREP max-readied-prep) to every node 


- Mfitrec 


So, the new FV algorithm, the new federated voting algorithm which is finite, works like this. So, 
what happens is that if you look at when you are preparing a ballot, you see what is the largest 
ballot that you have voted for. If it is less than b, and if this is clearly higher number ballot, because 
you want this to be monotonically increasing. See, if I not voted for the current ballot or for later 
ones once, then you said max-voted-prep to b, which is also what we were doing previously. So, 


in this case, what you do is that you say that look, I will keep on voting, I do not have an issue. 


But the issue, but the thing is that I will only vote for ballots with monotonically increasing 
numbers, then I will send the vote. So, the message that I will send is prepare prep, and max-voted- 
prep is just what I voted for, which is basically what I proposed the ballot that I proposed to all the 


nodes. So, this part is the same, it is like sending your basic vote messages. 


So, then what will happen is if there exists a maximum ballot b, so again, this is the next part. So, 


there is a ballot b (> max-voted-prep), and if every node u in the quorum that contains v has sent 
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vote (PREP b,,), where b,, has the following property. If there is any b’< b, it is also lower and b’ 
+ by. 


So, which basically means that if this is b, b,, is somewhere over here, where everything is lower 
and incompatible is also lower and incompatible to this. So, which basically means that everybody 


has voted either for the value in this ballot or for a ballot with a higher number, so you are fine. 


So, at least you are sure that this ballot has been prepared correctly, in the sense that anything 
which is lower and incompatible for this is also lower and incompatible with something else, but 
that is something else has been sent. So, in this case, you go for the ready version. So, max-readied- 
prep, you set it to b, and you send READY PREP. So again, the PREP message is the same, but 


this is the ready message max-readied-prep to every other node. 


So, what is the idea? So, here instead of maintaining a Boolean state as we were doing earlier, we 
are only increasing the numbers are max-voted-prep and max-readied-prep. So, this is like the 
same thing. It is like saying for a given round, I will vote on the only once. In other words, that is 


tantamount to saying that I will monotonically increase the round number. 


Which means, if let us say my current round number is 10, I will not consider any other round 
number which is between 1 and 10, but only something which is higher. Furthermore, if I receive 
a value of b,,, where lower and incompatible ballots are alive, then it can clearly cannot be 


prepared. So, I will wait, so I will not proceed with this statement over here, but I will wait. 
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Finite Version of the FV Algorithm — II 


if there exists a maximum ballotb)> max: -prep) and if every node/ u)i in a v-blocking 
set sends READY (PREP b,) where us tbh>b' tb, 


max-readied- prep © b 
send READY (PREP max-readied-prep) to every node 
eae 
— Fag) 


if there exists a maximum ballot b> max-delivered-prep p)and if every node, u, in the quorum 
(containing v) sends READY (PREP b,) what <b Oy 
max-delivered-prep €b 
prepared (max-delivered-prep) 


@) olplirer He "reoge 


NPTEL 


y deliver tk 


New FV Algorithm (Jini) * jy i 


mere) 

f (max-voted-prep < b) then 
max-voted-prep € 
send VOTE (PREP EP may-voted-prep) to all the nodes 

— 


—_—__ 


if there exists a maximum ballot6)(> max-voted-prep) and if every node, u, in the quorum 
(containing v) sends VOTE (PREP b,) where b' t ¢bab¢ b’ ¢ by 


max-readied-prep €{b) 
send READY (PRE EP max-readied- -readied-prep) to every node 


oom 8 


So now again, in a two-sub algorithm, so let us look at this. So, what had we seen in the earlier 


version of the infinite algorithm that once we get a vote from a quorum, which is over here, we 
send a ready message. So, here also we are doing the same and then we propagate the readied 


message when we get it from a v-blocking set. 


So, here exactly we are doing the same, is just that the format has changed a little bit. So, idea is 
the same, you need a ballot b (> max-readied-prep). And if every node u in a v-blocking set has 
sent PREP b,,, where again this condition holds, we have seen this before, which basically means 


anything which is lower and incompatible with your maximum ballot with your stored. 


If that is also lower and incompatible than b,, which you have gotten from another node, then you 
are fine. So, this is the same as kind of getting ready message from a max from a v-blocking set, 
for either this ballot or something which is newer. So, you can happily set max-readied-prep to b 


and just propagate the readied message ready message. 


So, this is just propagate function, where we are propagating the message PREP max-readied-prep 
to every other node, which is propagating the ready messages. So, after propagating, we again look 
at delivery. So, again delivery we have the same format that there is a max ballot b (>max- 


delivered-prep). 


So, this is basically saying again you deliver once per round. And if, every node u in the quorum 
sends READY PREP b,,, where again this condition holds same ballot or a newer. Then, we do 
the same, we set max-delivered-prep to b and we say that this has been fully prepared. So, we are 
not using the delivered term here, but we are saying that this has been fully prepared, or we are 
delivering the prepared message, so we can say deliver the message. And in this case, it was a 
prepare message, so, deliver the PREP message. So, the key idea is the same, we are not changing 


anything. In the earlier case, we are a Boolean variable. 


The Boolean variable was just saying for this tag if you have not prepared, then prepare if you 
have not delivered, then deliver. In this case, we are saying that the variable max-readied-prep and 
max-delivered-prep. And in this case, max-voted-prep will all increase monotonically, which 
basically means if something has happened for one round, it cannot happen again is the same thing. 
And so, this is again a two-phase process with exactly the same restrictions and limitations. It is 


just we maintain finite state. 
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The Commit Function 


— (b): 
€ ballots_voted_cmt and max-voted- d-prep = -b then 
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send VOTE (CMT b) to every node 


when received VOTE (CMT b) from a quorum and b € ballots_readied_cmt) 
ballots_readied_cmt € ballots_readied_cmt U 


send READY (CMT b) to every node 


VOTE 
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New FV Algorithm { /n./e) /A) 


prepare (b) 
if (max-voted-prep < b) then 


max-voted-prep € 
send VOTE (PREP max-voted-prep) to all the nodes 
— === rr 


—_—_—_,_ 


if there exists a maximum ballot6)(> max-voted-prep) and if every node, u, in the quorum 
(containing v) sends VOTE (PREP b,) where b' t ¢bab'¢ b' ¢ by 

max-readied-prep 

send READY [PRE P max-readied -readied-prep) to every node 


Or 8 


So, as we said we will have two functions prepare and commit. So, prepare is again federated 
voting on the on the prep message, and commit is basically the same. So, it will have a very similar 
format, it is slightly simple, simpler. So, here the idea is if the ballot b is not something that I voted 


commit for and max-voted-prep is equal to b. 


So, where is max-voted-prep set, I will show you, it is set over here in the prepare function. So, 
look, if this is what I have prepared, then I decide to commit it, I decide to go to the next step. 
When I have received VOTE commit b from a quorum and I have not readied the commit, so, in 


this case, I have not voted, and in this case I have not readied it. 
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Same idea, instead of Boolean and maintaining a set, because the point is that I could be 
committing and reading many ballots corresponding to different rounds. So again, in this case, if 
it is not a part of it, I make it a part of it. So, that is why I have the union operation over here. And 
once I have gotten the vote, I send a READY message, this was again the same. As you can see, 
the same pattern of sending VOTE messages and once you get VOTE messages from a quorum, 


you send a READY message. You send READY messages after a quorum votes. 
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Commit Function - II 


when received READY (CMT b) from a v-blocking set and b € ballots _readied_cmt 
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send READY (CMT b) to every node 
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New FV Algorithm (Jinie) — * yt 


mere) 
if (max-voted-prep < b) then 


max-voted-prep €0) 
send VOTE (PREP max-voted-prep) to all the nodes 
— —_— or 
TT 
if there exists a maximum ballot 6)(> max-voted-prep) and if every node, u, in the quorum 
(containing v) sends VOTE (PREP b,) where b' ¢ b> b' ¢ b, 
max-readied-prep ii 
send READY (PREP max-readied-prep) to every node 


8 


So, the commit function over here is that when you receive READY (CMT b) from a v-blocking 


set, so this is again the propagate. So, we have always had a function to propagate the ready 
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messages. So, when we receive READY commit from a v-blocking set, same idea, and we have 
already not readied it, then we put it into the ballots readied committed, READY commit set. And 
we send a READY (CMT b) message to every node, so to every node, we send a message saying 


that just propagating the READY message. And this part is again, the same we have seen it. 


So, in this case, we deliver the commit message when we get the READY message from a quorum, 
and b is not a part of ballots delivered commit. So, in this case, we add the ballot to ballots delivered 
commit and we commit the message, and what we do is that we finally deliver it. So, it is the end 


of the two-phase protocol. So, we commit the message, the same way we did like prepare. 
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Consensus Protocol q\- =— 
| “eq 
candidate, prepared, and round are initialized . 4 kg 
voting process \ 
A 


propose () 
candidate € <1, x> reese 
P.prepare (candidate) 
Pad 
when triggered P.prepared (b) and prepared < b 
if candidate < prepared 
candidaté € prepared 


6) P.commit (candidate) 
i When committed, decide (b.x) 
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NewFV Algorithm (Jinie) D> 2” 
prepare (b) ) 


if (max-voted-prep < b) then 


max-voted-prep ) 


a 


send VOTE (PREP max-voted-prep) to all the nodes 
a 


—_———t 


if there exists a maximum ballotb)\> max-voted-prep) and if every node, u, in the quorum 
(containing v) sends VOTE (PREP b,) where b' ¢ b> b' <b, 

max-readied-prep rs) fee 

send READY (PREP max-readied-prep) to every node 


So, now our job is much simpler. So, the basic consensus protocol as you will see, and again this 
is the finite version, actually becomes far far much much simpler. So, we initialize candidate, 
prepared and round the same way we were doing. We let P be the voting process. Propose x exactly 
the same thing where we have a candidate and we prepare the candidate. So, preparing the 
candidate would basically mean what it would mean going back to the federated voting of the 
prepared candidate, where basically we do it. But, how do we ensure that everything less than the 


it is incompatible? 


Well, using the check over here that anything that will lower and incompatible is also lower and 
incompatible with that. So, so that is how we ensure that the moment something is prepared, it is 
nothing lower and incompatible with it is actually alive. So, once the ballot has been prepared, and 


if let say prepare is less than the current ballots. 


So again, this is a standard check that we have had to ensure that we prepared a ballot only once. 
So, then what we do is if the candidate is less than equal to prepared, we set candidate to prepared. 
Again, we have seen this earlier, and we commit the candidate, and when it is committed, we 


decide. 


So, as you can see, this is the same pretty much as the algorithm with infinite resources where you 
prepare first, preparing means what? Take the ballot anything lower and incompatible. you cancel 
all of them essentially make everything vote false. And in this case, once that is done, once you 


have prepared something it becomes our current candidate. And then you commit it, commit it 
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basically means you make everybody agree on these values, make it vote true. If let us say this 
message is delivered in our two-phase protocol, then you decide, otherwise, you go for a timeout 


mechanism. 
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Both the algorithms 


Timeout are equivalent 


+ Same as the algorithm with infinite resources 


If a node receive messages for later rounds (from the 


entire quorum) 


Set the round to the minimum value of b,.n 


A in the quorum. 
Start a timer 
After a timeout 
G 
NeTeL Increment the round and 
prepare again 


And in the timeout mechanism, both the algorithms are equivalent. It is same as the algorithm with 
infinite resources. See, if a node, receives messages for later rounds from the entire quorum, any 
set round to the minimum value of bu dot n in the quorum, and you start a time. After the timeout 


you ensure that all the messages from this node has been received. 


And then you increment the round and then you again prepare for the later round. So, in the worst 
case, what can happen is because liveness is not guaranteed, the rounds will keep on increasing till 
infinity, which is fine. Because, the FLP result in any case says that if we have faulty processes, 


and because of that our algorithm is not converging. That is fine, it may go on forever. 
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Lying about Quorum Slices 


* We do not make any assumptions about faulty nodes 


— | 
* As long as we have an intact set that guarantees a non-empty quorum 
intersection comprising correct nodes, there is no problem 


‘ —— 
* All these protocols are obstruction free 
* This means that if the faulty nodes stop 
* Consensus is achieved (subject to bounded clock skew) 


O) 
So, now let us come to the issue of lying about quorum slices, so because we are considering 
Byzantine failures, nodes may lie about their quorum slices. So, let us not make any assumptions 
about faulty nodes not required. As long as we have an intact set that guarantees a non-empty 
quorum intersection comprising correct nodes, there is no problem. Even if they lie, you will see 
a proof in the paper, there is no issue. All of these protocols are obstruction free. This means that 
if the faulty nodes stop, consensus can still be achieved, subject to bounded clock skew, but 


consensus can be still achieved. 
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Key Lemmas in the Proof Sketch 


1. If two nodes in an.intact set send READY (t, a) and READY (t, a’) 
messages, then((a=a’). 


2. Ifa node?) commits a ballot(b4, then the largest ballot b2 
prepared by any other node in the same intact set (before the 
commit) is such that b1~ b2 


3. All correct nodes in an intact set ultimately decide the same value 
(if the algorithm terminates) 


4. Itisa proposed ale) 


NPTEL 
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New FV Algorithm 


prepare (b) 
if (max-voted-prep < b) then 
max-voted-prep € b 
send VOTE (PREP max-voted-prep) to all the nodes 


\ if there exists a maximum ballot b (> max-voted-prep) and if every node, u, in the quorum 
} (containing v) sends VOTE (PREP b,) wh 
max-readied-prep © b 
send READY (PREP max-readied-prep) to every node 


a ° 
Finite Version of the FV Algorithm — II 


if there exists a maximum ballot b (> max-readied-prep) and if every node, u, in a v-blocking 
set sends READY (PREP b,) where b’ < b= b' ¢ by a 


max-readied-prep € b 
send READY (PREP max-readied-prep) to every node 


if there exists a maximum ballot b (> max-delivered-prep) and if every node, u, in the quorum 
(containing v) sends READY (PREP b,) wherdb 
max-delivered-prep € b 
prepared (max-delivered-prep) 
® 
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Key Lemmas in the proof sketch. So, how do you prove that this is correct? So, I will maybe walk 
you through some few key Lemmas and the proofs are all there in the papers. So, if two nodes in 
an intact set send READY (t, a) and READY (t, a’) messages. And a= a’, so we have already seen 
this, and because of quorum intersection, this will happen. Because, the single correct node will 
not lie into the quorum is that it is a part of it. If v1 commits a ballot b1, say if let say node vl 
commits a ballot b1, then the largest ballot b2 prepared by any other node in the same intact set 


before the commit. 


So, before you commit is such that b1 ~ b2, this can be easily seen in the proof. So, this I can show 


you, it is being insured by this line over here. So, let me maybe clear off the view 12013 the slide. 
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So, it is this line over here which is saying that look everybody has to have prepared for this, or 
basically a compatible ballot which is larger. And given the fact that the other node will have to 


pass through this stage, and this check is being done as you can see everywhere, so again I will. 


So, as you can see, this check is being done everywhere. So, it is not possible to bypass this check 
at any step. So, that is why it is not possible that something else will get prepared and move through 
all the stages, once some node in intact set has gone as far as commit. So, the next thing is that all 


the correct sets in an intact set ultimately decide the same value. 


Of course, if the algorithm terminates, so we have already seen different 12106 of this. So, I will 
not describe this further and needless to say it is a proposed value. So, this again follows all the 
properties or let us say guarantees all the properties that an ambiguous algorithm is supposed to 


provide you. 
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Conclusion Cy ) 
ot a 


Some guarantees regarding the reliability of intersecting 
nodes and partial synchrony need to be made. 


So, what is the conclusion? The conclusion is that we were able to kind of give you a very different 
algorithm. But again, the algorithm is basically based on quorum some quorum slices, where of 
course, in the intersection you need a correct node which is not going to live. So, as long as this 
can be ensured, we can ensure that our protocol is much faster, because all the nodes do not have 
to participate. It is only a limited number of nodes into participate until one quorum decides, until 
at least one intact set I should be more precise, until at least one intact set decides then we are 


done. 
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Some guarantees regarding the reliability of nodes and partial synchrony need to be made. But, 
those are not impractical guarantees, they are quite practical guarantees. And that is why the stellar 
protocol is called an internet level protocol, because you can really achieve consensus on a very 


large and wide scale. 
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So, these are the two papers. I have primarily reference the commentary of the original paper, 
because I felt that the readability was much more. But, the original paper is over here. You can 
also go to the website or stellar and download the code, and take a look at it. So, that would be 
quite interesting. So, this was a reasonably long lecture, and but also stellar was a quite complex 


protocol. And it is also very new, and it is very fast and scalable. 


So, it is solving many of the existing issue that Bitcoin and Ethereum had, which was of a very 
low throughput. Of course, at the cost of additional assumptions, which at least to my mind are not 
very impractical. But, I would like to encourage all of you to go to the, go and take a look at the 


source code of stellar. 
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